I was trying to use CRC32() to uniquely identify distinctive domain names because it’s probably the most economical in MySQL datatype (int only takes 4 bytes) comparing to MD5() as a 32-char string. However, it seems hash collisions with CRC32 occur too frequently. Just in a set of about 800k string, got a bunch of duplicate ids. For eg:
fc0591 => 1009521187 123rainerbommert => 1009521187
Also, using the length of the string to make additional distinction does not seem to help. For eg, both has length of 9, same CRC32 but different strings
a1sellers => 3605292603 advertees => 3605292603
Some stats: out of over 7M domains, about 6200 collisions (duplicate CRC32 hashes, mostly twice, but 3 cases 3 strings yield the same hash)
3:04 am, November 17, 2008Brian Stinar /
Do you have this data available? Now that CRC32 is being supported in the SSE4 instruction set, I’m getting interested in collision rates for this. I haven’t seen anything on Google about CRC32 collision rates, but I haven’t looked exhaustively.
Thanks for the post.
12:32 pm, March 25, 2009Dennis Fyodorov /
The length of string make sence, but the string should not be to small try too use crc32(md5(domain)) helped in my case.