Is This An Appropriate Use Of Python's Built-in Hash Function?
Solution 1:
Python's hash function is designed for speed, and maps into a 64-bit space. Due to the birthday paradox, this means you'll likely get a collision at about 5 billion entries (probably way earlier, since the hash function is not cryptographical). Also, the precise definition of hash
is up to the Python implementation, and may be architecture- or even machine-specific. Don't use it you want the same result on multiple machines.
md5 is designed as a cryptographic hash function; even slight perturbations in the input totally change the output. It also maps into a 128-bit space, which makes it unlikely you'll ever encounter a collision at all unless you're specifically looking for one.
If you can handle collisions (i.e. test for equality between all members in a bucket, possibly by using a cryptographic algorithm like MD5 or SHA2), Python's hash function is perfectly fine.
One more thing: To save space, you should store the data in binary form if you write it to disk. (i.e. struct.pack('!q', hash('abc'))
/ hashlib.md5('abc').digest()
).
As a side note: is
is not equivalent to ==
in Python. You mean ==
.
Post a Comment for "Is This An Appropriate Use Of Python's Built-in Hash Function?"