CRC32 hash collision

I was trying to use CRC32() to uniquely identify distinctive domain names because it’s probably the most economical in MySQL datatype (int only takes 4 bytes) comparing to MD5() as a 32-char string. However, it seems hash collisions with CRC32 occur too frequently. Just in a set of about 800k string, got a bunch of duplicate ids. For eg:

fc0591 => 1009521187
123rainerbommert => 1009521187

Also, using the length of the string to make additional distinction does not seem to help. For eg, both has length of 9, same CRC32 but different strings

a1sellers => 3605292603
advertees => 3605292603

Some stats: out of over 7M domains, about 6200 collisions (duplicate CRC32 hashes, mostly twice, but 3 cases 3 strings yield the same hash)

Comments (2)

  1. 3:04 am, November 17, 2008Brian Stinar  / Reply

    Do you have this data available? Now that CRC32 is being supported in the SSE4 instruction set, I’m getting interested in collision rates for this. I haven’t seen anything on Google about CRC32 collision rates, but I haven’t looked exhaustively.

    Thanks for the post.

  2. 12:32 pm, March 25, 2009Dennis Fyodorov  / Reply

    The length of string make sence, but the string should not be to small try too use crc32(md5(domain)) helped in my case.

Leave a Reply to Dennis Fyodorov Cancel reply

Allowed Tags - You may use these HTML tags and attributes in your comment.

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Pingbacks (0)

› No pingbacks yet.