I was trying to use CRC32() to uniquely identify distinctive domain names because it’s probably the most economical in MySQL datatype (int only takes 4 bytes) comparing to MD5() as a 32-char string. However, it seems hash collisions with CRC32 occur too frequently. Just in a set of about 800k string, got a bunch of duplicate ids. For eg:
fc0591 => 1009521187 123rainerbommert => 1009521187
Also, using the length of the string to make additional distinction does not seem to help. For eg, both has length of 9, same CRC32 but different strings
a1sellers => 3605292603 advertees => 3605292603
Some stats: out of over 7M domains, about 6200 collisions (duplicate CRC32 hashes, mostly twice, but 3 cases 3 strings yield the same hash)
Leave a Reply