Skip to content Skip to sidebar Skip to footer

Error Extracting Text From Website: Attributeerror 'nonetype' Object Has No Attribute 'get_text'

I am scraping this website and get 'title' and 'category' as text using .get_text().strip(). I have a problem using the same approach for extracting the 'author' as text. data2 =

Solution 1:

You're very close--there's a couple of things I recommend. First, I'd recommend taking a closer look at the HTML--in this case the author names are actually in a ul, where each li contains a span where itemprop is 'name'. However, not all articles have any author names at all. In this case, with your current code, the call to links.find('div', {'itemprop': 'name'}) returns None. None, of course, has no attribute get_text. This means that line will throw an error, which in this case will just cause no value to be appended to the data2'author' list. I'd recommend storing the author(s) in a list like so:

authors = []
ul = links.find('ul', itemprop='creator')
for author in ul.find_all('span', itemprop='name'):
    authors.append(author.text.strip())
data2['authors'].append(authors)

This handles the case where there are no authors as we would expect, by "authors" being an empty list.

As a side note, putting your code inside a

try:
    ...
except:
    pass

construct is generally considered poor practice, for exactly the reason you're seeing now. Ignoring errors silently can give your program the appearance of running properly, while in fact any number of things could be going wrong. At the very least it's rarely a bad idea to print error info to stdout. Even just doing something like this is better than nothing:

try:
    ...
except Exceptionas exc:
    print(exc.__class__.__name__, exc)

For debugging, however, having the full traceback is often desirable as well. For this you can use the traceback module.

import traceback
try:
    ...
except:
    traceback.print_exc()

Solution 2:

Instead of using the strip method. Create a variable with all the items in and then use for loop and utilise .text

author = links.findAll('span', {"itemprop": "name"})
for i in author:
    data2["author"].append(i.text) #??????

prints

'author': ['Mark Zastrow', 'Barbara Mühlemann', 'Terry C. Jones', 'Peter de Barros Damgaard', 'Morten E. Allentoft', 'Irina Shevnina', 'Andrey Logvin', 'Emma Usmanova', ......

Post a Comment for "Error Extracting Text From Website: Attributeerror 'nonetype' Object Has No Attribute 'get_text'"