Skip to content Skip to sidebar Skip to footer

The U Before Strings

Using beautifulsoap I had parsed some values from an html table as follows: for string in soup.stripped_strings: all_tds.append(string) when I simply print the strings as fo

Solution 1:

Use codecs.open() or io.open() to open a text file using an appropriate text encoding (i.e. encoding="...") instead of opening a bytefile with open().

Solution 2:

You see the representation of Unicode strings that are contained in the list. When you print a list, repr() is called on each item in it:

>>> s = u'text…'>>> s
u'text\u2026'>>> print(s)
text…
>>> print([s]) # <-- a list with a single item (the string)
[u'text\u2026']

u'' is a syntax for Unicode literals that may be used to defined Unicode strings in Python source code. Note: if you use non-ascii characters inside a string literal then you should define the source code encoding at the top of the module e.g., # -*- coding: utf-8 -*-.

To fix UnicodeEncodeError when writing to a file, you need to convert Unicode strings to bytes. BeautifulSoup provides several html-specific ways to do it.

Note: In general, the generic codecs.open() or io.open()suggested by @Ignacio Vazquez-Abrams won't be appropriate for an html text e.g., they don't modify <meta charset="..."> tag.

Solution 3:

Try converting them to strings:

forstring in soup.stripped_strings:
    all_tds.append(str(string))

Here it is with a list comprehension:

all_tds = [str(string) forstringin soup.stripped_strings]

Post a Comment for "The U Before Strings"