Skip to content Skip to sidebar Skip to footer

Python: Calculate Cosine Similarity Of Two Dicts Faster

I have two dicts: d1 = {1234: 4, 125: 7, ...} d2 = {1234: 8, 1288: 5, ...} The lengths of dicts vary from 10 to 40000. To calculate the cosine similarity I use this function: from

Solution 1:

If the indexes are not too high, you could convert each dictionary into an array. If they are very big, you could use an sparse array. Then, the cosine similarity would just multiply them both. This method would perform best if you have to reuse the same dictionary for several calculations.

If this is not an option, Cython should be pretty fast, as long as you annotate a_value and b_value.

Edit: Looking at your Cython rewritting, I see a few improvements. The first thing is to do a cython -a to generate a HTML report of the compilation and see which things have been accelerated and which haven't. First of all, you define "up" as long, but you are summing integers. Also, in your example, keys are integers, but you are declaring them as double. Another easy thing would be to type the input as dicts.

Also, inspecting the C code, it seems there is some none checking, which you can disable by using @cython.nonechecks(False).

Actually, the implementation of dictionaries is quite efficient, so in a general case, you will probably not get much better than that. If you need to squeeze the most out of your code, perhaps it is worth replacing some calls with the C API: http://docs.python.org/2/c-api/dict.html

cpython.PyDict_GetItem(a, key)

But then, you would be responsible for reference counting and casting from PyObject * to int for a dubious performance gain.

Any way, the beginning of the code would look like this:

cimport cython

@cython.nonecheck(False)@cython.cdivision(True)deffast_cosine_sim(dict a, dict b):
    iflen(b) < len(a):
        a, b = b, a

    cdefint up, key
    cdefint a_value, b_value

Yet another concern: are your dicionaries big? Because if they are not, the computing of norm could actually be an important overhead.

Edit2: Another possible approach is to only look at the keys that are necessary. Say:

from scipy.linalg import norm
cimport cython

@cython.nonecheck(False)@cython.cdivision(True)deffast_cosine_sim(dict a, dict b):
    cdefint up, key
    cdefint a_value, b_value

    up = 0for key inset(a.keys()).intersection(b.keys()):
        a_value = a[key]
        b_value = b[key]
        up += a_value * b_value
    if up == 0:
        return0return up / norm(a.values()) / norm(b.values())

This is very efficient in Cython. The actual performance will probably depend on how much overlap is there between the keys.

Solution 2:

From a algorithm standpoint, no. You are already at complexity O(N). There are some computational tricks you could use, though.

You could use the multiprocessing module to dispatch the a_value * b.get(key, 0) multiplication to several workers and thus make use of all machine cores you have. Note that you won't get this effect using threads because Python has a global interpreter lock.

The easiest way to do it would be using the multiproccess.Pool and the map method of the Pool object.

I strongly advise using the Python built-in cProfiler to check for hot spots in the code. It's super easy. Just run:

python -m cProfile myscript.py

Post a Comment for "Python: Calculate Cosine Similarity Of Two Dicts Faster"