How To Use The Infer_vector In Gensim.doc2vec?
Solution 1:
As you've noticed, infer_vector()
requires its doc_words
argument to be a list of tokens – matching the same kind of tokenization that was used in training the model. (Passing it a string causes it to just see each individual character as an item in a tokenized list, and even if a few of the tokens are known vocabulary tokens – as with 'a' and 'I' in English – you're unlikely to get good results.)
Additionally, the default parameters of infer_vector()
may be far from optimal for many models. In particular, a larger steps
(at least as large as the number of model training iterations, but perhaps even many times larger) is often helpful. Also, a smaller starting alpha
, perhaps just the common default for bulk training of 0.025, may give better results.
Your test of whether inference gets a vector close to the same vector from bulk-training is a reasonable sanity-check, on both your inference parameters and the earlier training – is the model as a whole learning generalizable patterns in the data? But because most modes of Doc2Vec inherently use randomness, or (during bulk training) can be affected by the randomness introduced by multiple-thread scheduling jitter, you shouldn't expect identical results. They'll just get generally closer, the more training iterations/steps you do.
Finally, note that the most_similar()
method on Doc2Vec
's docvecs
component can also take a raw vector, to give back a list of most-similar already-known vectors. So you can try the following...
ivec = model.infer_vector(doc_words=tokens_list, steps=20, alpha=0.025)
print(model.most_similar(positive=[ivec], topn=10))
...and get a ranked list of the top-10 most-similar (doctag, similarity_score)
pairs.
Post a Comment for "How To Use The Infer_vector In Gensim.doc2vec?"