CRC32 hash collision

I was trying to use CRC32() to uniquely identify distinctive domain names because it’s probably the most economical in MySQL datatype (int only takes 4 bytes) comparing to MD5() as a 32-char string. However, it seems hash collisions with CRC32 occur too frequently. Just in a set of about 800k string, got a bunch of duplicate ids. For eg:

fc0591 => 1009521187
123rainerbommert => 1009521187

Also, using the length of the string to make additional distinction does not seem to help. For eg, both has length of 9, same CRC32 but different strings

a1sellers => 3605292603
advertees => 3605292603

Some stats: out of over 7M domains, about 6200 collisions (duplicate CRC32 hashes, mostly twice, but 3 cases 3 strings yield the same hash)


Posted

in

by

Tags:

Comments

2 responses to “CRC32 hash collision”

  1. Brian Stinar Avatar
    Brian Stinar

    Do you have this data available? Now that CRC32 is being supported in the SSE4 instruction set, I’m getting interested in collision rates for this. I haven’t seen anything on Google about CRC32 collision rates, but I haven’t looked exhaustively.

    Thanks for the post.

  2. Dennis Fyodorov Avatar
    Dennis Fyodorov

    The length of string make sence, but the string should not be to small try too use crc32(md5(domain)) helped in my case.

Leave a Reply

Your email address will not be published. Required fields are marked *