How To Split A Huge Csv File Based On Content Of First Column?

May 25, 2024 Post a Comment

I have a 250MB+ huge csv file to upload file format is group_id, application_id, reading and data could look like 1, a1, 0.1 1, a1, 0.2 1, a1, 0.4 1, a1, 0.3 1, a1, 0.0 1, a1,

Solution 1:

awk is capable:

 awk -F"," '{print $0>> ("FILE"$1)}' HUGE.csv

Solution 2:

If the file is already sorted by group_id, you can do something like:

import csv
from itertools import groupby

for key, rowsin groupby(csv.reader(open("foo.csv")),
                         lambda row: row[0]):
    withopen("%s.txt" % key, "w") as output:
        forrowinrows:
            output.write(",".join(row) + "\n")

Solution 3:

Sed one-liner:

sed -e '/^1,/wFile1' -e '/^2,/wFile2' -e '/^3,/wFile3' ... OriginalFile

The only down-side is that you need to put in n-e statements (represented by the ellipsis, which shouldn't appear in the final version). So this one-liner might be a pretty long line.

The upsides, though, are that it only makes one pass through the file, no sorting is assumed, and no python is needed. Plus, it's a one-freaking-liner!

Solution 4:

If the rows are sorted by group_id, then itertools.groupby would be useful here. Because it's an iterator, you won't have to load the whole file into memory; you can still write each file line by line. Use csv to load the file (in case you didn't already know about it).

Solution 5:

If they are sorted by the group id you can use the csv module to iterate over the rows in the files and output it. You can find information about the module here.

lacucinadiadine