Skip to content Skip to sidebar Skip to footer

How To Get Cython And Gensim To Work With Pyspark

I'm running a Lubuntu 16.04 Machine with gcc installed. I'm not getting gensim to work with cython because when I train a doc2vec model, it is only ever trained with one worker whi

Solution 1:

After digging deeper and trying things like loading the whole corpus into memory executing gensim in a different environment etc. all with no effect. It seems it is a problem with gensim that the code is only partial parallelized. This results in the workers not being able to fully utilize the CPU. See the issues on github link.

Solution 2:

You probably did this, but could you please check that you are using the parallel Cythonised version by assert gensim.models.doc2vec.FAST_VERSION > -1 ?

The gensim doc2vec code is parallelized but unfortunately the I/O code that is outside of Gensim isn't. For example, in the github issue you linked parallelization is indeed achieved after the corpus is loaded into RAM by doclist = [doc for doc in documents]

Post a Comment for "How To Get Cython And Gensim To Work With Pyspark"