CRC32 hash collision

I was trying to use CRC32() to uniquely identify distinctive domain names because it’s probably the most economical in MySQL datatype (int only takes 4 bytes) comparing to MD5() as a 32-char string. However, it seems hash collisions with CRC32 occur too frequently. Just in a set of about 800k string, got a bunch of duplicate ids. For eg:

fc0591 => 1009521187
123rainerbommert => 1009521187

Also, using the length of the string to make additional distinction does not seem to help. For eg, both has length of 9, same CRC32 but different strings

a1sellers => 3605292603
advertees => 3605292603

Some stats: out of over 7M domains, about 6200 collisions (duplicate CRC32 hashes, mostly twice, but 3 cases 3 strings yield the same hash)

Posted

April 16, 2006

Uncategorized

admin

Tags:

Comments

2 responses to “CRC32 hash collision”

Brian Stinar

November 17, 2008

Do you have this data available? Now that CRC32 is being supported in the SSE4 instruction set, I’m getting interested in collision rates for this. I haven’t seen anything on Google about CRC32 collision rates, but I haven’t looked exhaustively.

Thanks for the post.

Reply
Dennis Fyodorov

March 25, 2009

The length of string make sence, but the string should not be to small try too use crc32(md5(domain)) helped in my case.

Reply

CRC32 hash collision

Comments

2 responses to “CRC32 hash collision”

Leave a Reply Cancel reply