How Can I Download And Read A Url With Universal Newlines?
I was using urllib.urlopen with Python 2.7, but I need to process the downloaded HTML document and its contained newlines (within a
element). The urllib docs indicates
Solution 1:
Unless the HTML file is already on your disk, urlopen()
will handle correctly all formats of newlines (\n
, \r\n
and \r
) in the HTML file you want to parse (that is it will convert them to \n
), according to the urllib docs:
"If the URL does not have a scheme identifier, or if it has file: as its scheme identifier, this opens a local file (without universal newlines)"
E.g.
>>>from urllib import urlopen>>>urlopen("http://****.com/win_new_lines.htm").read()
'line 1\nline 2\n\n\nline 3'
>>>urlopen("http://****.com/unix_new_lines.htm").read()
'line 1\nline 2\n\n\nline 3'
Solution 2:
When you process the contents of the pre
tags, use splitlines to normalize the line-endings:
'\n'.join(contents.splitlines())
Post a Comment for "How Can I Download And Read A Url With Universal Newlines?"