How To Convert Token List Into Wordnet Lemma List Using Nltk?
I have a list of tokens extracted out of a pdf source. I am able to pre process the text and tokenize it but I want to loop through the tokens and convert each token in the list to
Solution 1:
You are calling wordnet.synsets(text)
with a list of words (check what is text
at that point) and you should call it with a word
.
The preprocessing of wordnet.synsets
is trying to apply .lower()
to its parameters and therefore the error (AttributeError: 'list' object has no attribute 'lower'
).
Below there is a functional version of clean_text
with a fix of this problem:
import string
import re
import nltk
from nltk.corpus import wordnet
stopwords = nltk.corpus.stopwords.words('english')
wn = nltk.WordNetLemmatizer()
def clean_text(text):
text = "".join([word for word intextif word notinstring.punctuation])
tokens = re.split("\W+", text)
text = [wn.lemmatize(word) for word in tokens if word notin stopwords]
lemmas = []
for token intext:
lemmas += [synset.lemmas()[0].name() for synset in wordnet.synsets(token)]
return lemmas
text = "The grass was greener."
print(clean_text(text))
Returns:
['grass', 'Grass', 'supergrass', 'eatage', 'pot', 'grass', 'grass', 'grass', 'grass', 'grass', 'denounce', 'green', 'green', 'green', 'green', 'fleeceable']
Post a Comment for "How To Convert Token List Into Wordnet Lemma List Using Nltk?"