Sunday, April 16, 2006

CRC32 hash collision

I was trying to use CRC32() to uniquely identify distinctive domain names because it's probably the most economical in MySQL datatype (int only takes 4 bytes) comparing to MD5() as a 32-char string. However, it seems hash collisions with CRC32 occur too frequently. Just in a set of about 800k string, got a bunch of duplicate ids. For eg:
fc0591 => 1009521187
123rainerbommert => 1009521187
Also, using the length of the string to make additional distinction does not seem to help. For eg, both has length of 9, same CRC32 but different strings
a1sellers => 3605292603
advertees => 3605292603
Some stats: out of over 7M domains, about 6200 collisions (duplicate CRC32 hashes, mostly twice, but 3 cases 3 strings yield the same hash)

2 Comments:

At 7:04 PM, Blogger Brian Stinar said...

Do you have this data available? Now that CRC32 is being supported in the SSE4 instruction set, I'm getting interested in collision rates for this. I haven't seen anything on Google about CRC32 collision rates, but I haven't looked exhaustively.

Thanks for the post.

 
At 5:32 AM, Blogger Dennis Fyodorov said...

The length of string make sence, but the string should not be to small try too use crc32(md5(domain)) helped in my case.

 

Post a Comment

<< Home