Skip to content Skip to sidebar Skip to footer

Nltk Lemmatizer, Extract Meaningful Words

Currently, I am going to create a machine learning based code that automatically maps categories. I am going to do natural language processing before that. There are several words

Solution 1:

TL;DR

It's an XY problem of a lemmatizer failing to meet your expectation, when the lemmatizer you're using is to solved a different problem.


In Long

Q: What is a lemma?

Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. - Wikipedia

Q: What is the "dictionary form"?

NLTK is using the morphy algorithm which is using WordNet as the basis of "dictionary forms"

See also How does spacy lemmatizer works?. Note SpaCy has additional hacks put in to handle more irregular words.

Q: Why moisture -> moisture and moisturizing -> moisturizing?

Because there are synset (sort of "dictionary form") for "moisture" and "moisturizing"

>>>from nltk.corpus import wordnet as wn>>>wn.synsets('moisture')
[Synset('moisture.n.01')]
>>>wn.synsets('moisture')[0].definition()
'wetness caused by water'

>>>wn.synsets('moisturizing')
[Synset('humidify.v.01')]
>>>wn.synsets('moisturizing')[0].definition()
'make (more) humid'

Q: How could I get moisture -> moist?

Not really useful. But maybe try a stemmer (but don't expect too much of it)

>>>from nltk.stem import PorterStemmer>>>porter = PorterStemmer()>>>porter.stem("moisture")
'moistur'

>>>porter.stem("moisturizing")
'moistur'

Q: Then how do I get moisuturizing/moisuture -> moist?!!

There's no well-founded way to do that. But before even trying to do that, what is the eventual purpose of doing moisuturizing/moisuture -> moist.

Is it really necessary to do that?

If you really want, you can try word vectors and try to look for most similar words but there's a whole other world of caveats that comes with word vectors.

Q: Wait a minute but heard -> heard is ridiculous?!

Yeah, the POS tagger isn't tagging the heard correctly. Most probably because the sentence is not a proper sentence, so the POS tags are wrong for the words in the sentence:

>>> from nltk import word_tokenize, pos_tag
>>> sent
'The laughs you two heard were triggered by memories of his own high j-flying moist moisture moisturize moisturizing.'>>> pos_tag(word_tokenize(sent))
[('The', 'DT'), ('laughs', 'NNS'), ('you', 'PRP'), ('two', 'CD'), ('heard', 'NNS'), ('were', 'VBD'), ('triggered', 'VBN'), ('by', 'IN'), ('memories', 'NNS'), ('of', 'IN'), ('his', 'PRP$'), ('own', 'JJ'), ('high', 'JJ'), ('j-flying', 'NN'), ('moist', 'NN'), ('moisture', 'NN'), ('moisturize', 'VB'), ('moisturizing', 'NN'), ('.', '.')]

We see that heard is tagged as NNS (a noun). If we lemmatized it as a verb:

>>>from nltk.stem import WordNetLemmatizer>>>wnl = WordNetLemmatizer()>>>wnl.lemmatize('heard', pos='v')
'hear'

Q: Then how do I get a correct POS tag?!

Probably with SpaCy, you get ('heard', 'VERB'):

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> sent
'The laughs you two heard were triggered by memories of his own high j-flying moist moisture moisturize moisturizing.'>>> doc = nlp(sent)
>>> [(word.text, word.pos_) for word in doc]
[('The', 'DET'), ('laughs', 'VERB'), ('you', 'PRON'), ('two', 'NUM'), ('heard', 'VERB'), ('were', 'VERB'), ('triggered', 'VERB'), ('by', 'ADP'), ('memories', 'NOUN'), ('of', 'ADP'), ('his', 'ADJ'), ('own', 'ADJ'), ('high', 'ADJ'), ('j', 'NOUN'), ('-', 'PUNCT'), ('flying', 'VERB'), ('moist', 'NOUN'), ('moisture', 'NOUN'), ('moisturize', 'NOUN'), ('moisturizing', 'NOUN'), ('.', 'PUNCT')]

But note, in this case, SpaCy got ('moisturize', 'NOUN') and NLTK got ('moisturize', 'VB').

Q: But can't I get moisturize -> moist with SpaCy?

Lets not go back to the start where we define what is a lemma. In short:

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> sent
'The laughs you two heard were triggered by memories of his own high j-flying moist moisture moisturize moisturizing.'>>> doc = nlp(sent)
>>> [word.lemma_ for word in doc]
['the', 'laugh', '-PRON-', 'two', 'hear', 'be', 'trigger', 'by', 'memory', 'of', '-PRON-', 'own', 'high', 'j', '-', 'fly', 'moist', 'moisture', 'moisturize', 'moisturizing', '.']

See also How does spacy lemmatizer works? (again)

Q: Okay, fine. I can't get moisturize -> moist... And POS tag is not perfect for heard -> hear. But why can't I get j-flying -> fly?

Back to the question of why do you need to convert j-flying -> fly, there are counter examples of why you wouldn't want to separate something that looks like a compound.

For example:

  • Should Classical-sounding go to sound?
  • Should X-fitting go to fit?
  • Should crash-landing go to landing?

Depends on what's the ultimate purpose of your application, converting a token to your desired form may or may not be necessary.

Q: Then what is a good way to extract meaningful words?

I sound like a broken record but it depends on what's your ultimate goal?

If you goal is really to understand the meaning of words then you have to ask yourself the question, "What is the meaning of meaning?"

Does individual word has a meaning out of its context? Or would it have the sum of meanings from all the possible context it could occur in.

Au currant, the state-of-art basically treats all meanings as an array of floats and comparisons between array of floats are what give meaning its meaning. But is that really meaning or just an means to an end? (Pun intended).

Q: Why am I get more questions than answers?

Welcome to the world of computational linguistics which has its roots from philosophy (like computer science). Natural language processing is commonly known as the application of computational linguistics


Food for thought

Q: Is a lemmatizer better than a stemmer?

A: No definite answer. (c.f. Stemmers vs Lemmatizers)

Solution 2:

Lemmatization is not an easy task. You should not expect perfect results. Yiu can however see if you like the results of other lemmatization libraries better.

Spacy is an obvious Python option to evaluate. Stanford core nlp is another(JVM based and GPLed).

There are oher options, none will be perfect.

Post a Comment for "Nltk Lemmatizer, Extract Meaningful Words"