How To Split A Huge Csv File Based On Content Of First Column?
Solution 1:
awk
is capable:
awk -F"," '{print $0>> ("FILE"$1)}' HUGE.csv
Solution 2:
If the file is already sorted by group_id
, you can do something like:
import csv
from itertools import groupby
for key, rowsin groupby(csv.reader(open("foo.csv")),
lambda row: row[0]):
withopen("%s.txt" % key, "w") as output:
forrowinrows:
output.write(",".join(row) + "\n")
Solution 3:
Sed one-liner:
sed -e '/^1,/wFile1' -e '/^2,/wFile2' -e '/^3,/wFile3' ... OriginalFile
The only down-side is that you need to put in n-e
statements (represented by the ellipsis, which shouldn't appear in the final version). So this one-liner might be a pretty long line.
The upsides, though, are that it only makes one pass through the file, no sorting is assumed, and no python is needed. Plus, it's a one-freaking-liner!
Solution 4:
If the rows are sorted by group_id
, then itertools.groupby
would be useful here. Because it's an iterator, you won't have to load the whole file into memory; you can still write each file line by line. Use csv
to load the file (in case you didn't already know about it).
Solution 5:
If they are sorted by the group id you can use the csv module to iterate over the rows in the files and output it. You can find information about the module here.
Post a Comment for "How To Split A Huge Csv File Based On Content Of First Column?"