Skip to content Skip to sidebar Skip to footer

Dealing With Windows Line-endings In Python

I've got a 700MB XML file coming from a Windows provider. As one might expect, the line endings are '\r\n' (or ^M in vi). What is the most efficient way to deal with this situatio

Solution 1:

Why are the DOS line-endings a problem? Most things can deal with them just fine, including XML parsers. If you really want to get rid of them, open the file in universal line-endings mode:

open(filename, 'rU')

Python will convert all line-endings to UNIX line-endings for you. If you really can't use that (which I find a little surprising), there's no way to get Python to do the work for you. You will have to open the file regardless, though, so your objection to #2 seems a little odd.

Solution 2:

Are you opening the file in text mode or binary mode? I'm pretty sure I've counted on universal newlines on my Leopard install, but maybe I got an updated Python from somewhere too...

Anyway- I've seen this sort of thing biting many programmers in the bum, because they just reach for the 'b' key. Use a 't' if you're opening text files known to be created on your platform, 'U' instead of 't' if you need universal newlines.

withfile(filename, 'rt') as f:
   content = f.read()

Edit: The comments note that 'rt' is the default. Fair point, but Python style tends to prefer explicit over implicit, so I'm going with that.

Solution 3:

Allegedly: """This guy has \r\n right in the middle of tag descriptors like so: <ParentRedirec tSequenceID>""".

I see no \r\n here. Perhaps you mean repr(xml) contains things like

"<ParentRedirec\r\ntSequenceID>"

If not, try to say precisely what you mean, with repr-fashion examples.

The following should work:

>>>import re>>>guff = """<atag>\r\n<bt\r\nag c="2">""">>>re.sub(r"(<[^>]*)\r\n([^>]*>)", r"\1\2", guff)
'<atag>\r\n<btag c="2">'
>>>

If there is more than one line break in a tag e.g. <foo\r\nbar\r\nzot> this will fix only the first. Alternatives (1) loop until the guff stops shrinking (2) write a smarter regexp yourself :-)

Solution 4:

What are you trying to do with this file? Whitespace between tags is usually ignored in XML, so the only place where line endings matter tags' content.

Post a Comment for "Dealing With Windows Line-endings In Python"