Splitlines() And Iterating Over An Opened File Give Different Results
Solution 1:
Why don't you just split it:
input = b'\nabc\r\r\r\nd\ref\nghi\r\njkl'
result = input.split(b'\n')
print(result)
[b'', b'abc\r\r\r', b'd\ref', b'ghi\r', b'jkl']
You will loose the trailing \n
that can be added later to every line, if you really need them. On the last line there is a need to check if it is really needed. Like
fixed = [bstr + b'\n'for bstr in result]
ifinput[-1] != b'\n':
fixed[-1] = fixed[-1][:-1]
print(fixed)
[b'\n', b'abc\r\r\r\n', b'd\ref\n', b'ghi\r\n', b'jkl']
Another variant with a generator. This way it will be memory savvy on the huge files and the syntax will be similar to the original for l in bin_split(input)
:
defbin_split(input_str):
start = 0while start>=0 :
found = input_str.find(b'\n', start) + 1if0 < found < len(input_str):
yield input_str[start : found]
start = found
else:
yield input_str[start:]
break
Solution 2:
There are a couple ways to do this, but none are especially fast.
If you want to keep the line endings, you might try the re
module:
lines = re.findall(r'[\r\n]+|[^\r\n]+[\r\n]*', text)
# or equivalently
line_split_regex = re.compile(r'[\r\n]+|[^\r\n]+[\r\n]*')
lines = line_split_regex.findall(text)
If you need the endings and the file is really big, you may want to iterate instead:
for r in re.finditer(r'[\r\n]+|[^\r\n]+[\r\n]*', text):
line = r.group()
# do stuff with line here
If you don't need the endings, then you can do it much more easily:
lines = list(filter(None, text.splitlines()))
You can omit the list()
part if you just iterate over the results (or if using Python2):
for line infilter(None, text.splitlines()):
pass# do stuff with line
Solution 3:
I would iterate through like this:
text ="b'abc\r\r\ndef'"
results = text.split('\r\r\n')
for r in results:
print(r)
Solution 4:
This is a for l in f:
solution:
The key to this is the newline
argument on the open
call. From the documentation:
[![enter image description here][1]][1]
Therefore, you should use newline=''
when writing to suppress newline translation and then when reading use newline='\n'
, which will work if all your lines terminate with 0 or more '\r'
characters followed by a '\n'
character:
withopen('test.txt', 'w', newline='') as f:
f.write('abc\r\r\ndef')
withopen('test.txt', 'r', newline='\n') as f:
for line in f:
print(repr(line))
Prints:
'abc\r\r\n''def'
A quasi-splitlines solution:
This strictly speaking not a splitlines
solution since to be able to handle arbitrary line endings a regular expression version of split
would have to be used capturing the line endings and then re-assembling the lines and their endings. So, instead this solution just uses a regular expression to break up the input text allowing line endings consisting of any number of '\r'
characters followed by a '\n'
character:
import re
input = '\nabc\r\r\ndef\nghi\r\njkl'withopen('test.txt', 'w', newline='') as f:
f.write(input)
withopen('test.txt', 'r', newline='') as f:
text = f.read()
lines = re.findall(r'[^\r\n]*\r*\n|[^\r\n]+$', text)
for line in lines:
print(repr(line))
Prints:
'\n''abc\r\r\n''def\n''ghi\r\n''jkl'
Post a Comment for "Splitlines() And Iterating Over An Opened File Give Different Results"