Is This An Appropriate Use Of Python's Built-in Hash Function?

June 09, 2024 Post a Comment

I need to compare large chunks of data for equality, and I need to compare many pairs per second, fast. Each object is guaranteed to be the same length, it is possible and likely t

Solution 1:

Python's hash function is designed for speed, and maps into a 64-bit space. Due to the birthday paradox, this means you'll likely get a collision at about 5 billion entries (probably way earlier, since the hash function is not cryptographical). Also, the precise definition of hash is up to the Python implementation, and may be architecture- or even machine-specific. Don't use it you want the same result on multiple machines.

md5 is designed as a cryptographic hash function; even slight perturbations in the input totally change the output. It also maps into a 128-bit space, which makes it unlikely you'll ever encounter a collision at all unless you're specifically looking for one.

If you can handle collisions (i.e. test for equality between all members in a bucket, possibly by using a cryptographic algorithm like MD5 or SHA2), Python's hash function is perfectly fine.

One more thing: To save space, you should store the data in binary form if you write it to disk. (i.e. struct.pack('!q', hash('abc')) / hashlib.md5('abc').digest()).

Baca Juga

As a side note: is is not equivalent to == in Python. You mean ==.

lacucinadiadine

Is This An Appropriate Use Of Python's Built-in Hash Function?

Solution 1:

Post a Comment for "Is This An Appropriate Use Of Python's Built-in Hash Function?"

Widget HTML #3