Why Does Dict Have Worst Case O(n) For So Many Operations?

October 17, 2023 Post a Comment

How exactly is dict implemented that it has a linear time lookup for collisions? I would assume that it is implemented as a hashtable backed by a list. I would presume that a bette

Solution 1:

Dict is O(1) for most operations, except for operations that touch all elements, such as iteration and copy (in which case, it's obviously O(n)).

See: http://wiki.python.org/moin/TimeComplexity

It has O(n) worst case, because you can always contrive a pathological example where all the keys have the same hash value.

Solution 2:

Consider even the best hash function in the galaxy. There's still a chance that you may walk up one day with a list of values whose best hash function value happens to be all the same. If you put those in a dict, the system has no choice but to perform linear searches.

Using a balanced tree would keep the worst-case time down at O(log n), but the maintenance costs are pretty high. Usually, hash tables perform pretty well.

Solution 3:

I would presume that a better implementation would be O(log(n)) for various operations, using a tree to back the table instead.

Trees and hash tables have very different requirements and performance characteristics.

Trees require an ordered type.
Trees require performing order comparisons to find the object. For some objects, like strings, this prevents some significant optimizations: you always need to perform a string comparison, which is nontrivially expensive. This makes the constant factor of O(log n) fairly high.
Hash tables require a hashable type, and that you can test for equality, but they don't require an ordered type.
Testing for equality can be optimized significantly. If two strings are interned, you can test if they're equal in O(1) by comparing their pointer, rather than O(n) by comparing the whole string. This is a massive optimization: in every foo.bar lookup that gets translated to foo.__dict__["bar"], "bar" is an interned string.
Hash tables are O(n) in the worst case, but examine what leads to that worst case: a very bad hash table implementation (eg. you only have one bucket), or a broken hash function that always returns the same value. When you have a proper hash function and a proper bucketing algorithm, lookups are very cheap--very often approaching constant time.

Trees do have significant advantages:

They tend to have lower memory requirements, since they don't have to preallocate buckets. The smallest tree might be 12 bytes (node pointer and two child pointers), where a hash table tends to be 128 bytes or more--sys.getsizeof({}) on my system is 136.
They allow ordered traversal; it's extremely useful to be able to iterate over [a,b) in an ordered set, which hash tables don't allow.

I do consider it a flaw that Python doesn't have a standard binary tree container, but for the performance characteristics needed by the Python core, like __dict__ lookups, a hash table does make more sense.

Solution 4:

Reliable sources of information about the hash functions and collision-resolution strategy that are actually used include the comments in the source file dictobject.c and the whole of the file dictnotes.txt

lacucinadiadine

Why Does Dict Have Worst Case O(n) For So Many Operations?

Solution 1:

Solution 2:

Solution 3:

Solution 4:

Post a Comment for "Why Does Dict Have Worst Case O(n) For So Many Operations?"

Widget HTML #3