Delete Multiple Columns From 500 Mb Tsv File With Python (or Perl Etc)
Solution 1:
You can use del
to delete slices of a list.
withopen('in.tsv', 'r') as fin, open('out.tsv', 'w') as fout:
reader = csv.reader(fin, dialect='excel-tab')
writer = csv.writer(fout, dialect='excel-tab')
for row in reader:
# delete indices in reverse order to avoid shifting earlier indicesdel row[653321:689513+1]
del row[628715:650181+1]
writer.writerow(row)
Solution 2:
You can do this with very little memory using Python.
First define a dialect describing your tsv format. See the documentation on dialects for more information.
classTsvDialect(csv.Dialect):
delimiter = '\t'
quoting = csv.QUOTE_NONE
escapechar = None
# you can just pass thisclassaround, or you can register it under a name
csv.register_dialect('tsv', TsvDialect)
Then you can walk through each line and copy to a new tsv:
withopen('source.tsv', 'rb') as src, open('result.tsv', 'wb') as res:
csrc = csv.reader(src, dialect='tsv')
cres = csv.writer(res, dialect='tsv')
for row in csrc:
cres.writerow(row)
This does a simple copy. Since you only want some rows, lets only copy those.
Python's lists are zero-indexed (the first column is column 0, not column 1); and index slicing does not include the last item (wholelist[:2]
is the same as [wholelist[0], wholelist[1]]
). Keep these in mind to avoid off-by-one errors!
withopen('source.tsv', 'rb') as src, open('result.tsv', 'wb') as res:
csrc = csv.reader(src, dialect='tsv')
cres = csv.writer(res, dialect='tsv')
for row in csrc:
# remove [628714:650181] and [653320:689512]
newrow = row[:628714] # columns before 628714
newrow.extend(row[650181:653320]) # columns between 650180 and 653320
cres.writerow(newrow)
Alternatively, instead of copying the columns you want to a new row, you can save some memory at the expense of code clarity by deleting the columns you don't want:
forrowincsrc:
# remove[628714:650181]and[653320:689512]
# besuretoremoveinreverseorder!
delrow[653320:689512]delrow[628714:650181]cres.writerow(row)
You can abstract column cutting (either method, using any indexing you're comfortable with) into a function if you need to do this very often.
You might also want to take a look at the csvkit python library and command-line tools, in particular its command-line tool csvcut, which appears to do exactly what you want from the command line.
Solution 3:
With 2 GB RAM or more, it should be possible to load the dataset in memory, delete the columns you want, and write the contents to a file. This could either be done in R or python easily. For R:
dat = read.table("spam.tsv", ...)
dat = dat[-c(1,5)] # delete row 1and5write.csv(dat, ....)
Doing this in chunks can easily be done using either an apply
loop or a for
loop. I use the apply
style:
read_chunk = function(chunk_index, chunk_size, fname) {
dat = read.table(fname, nrow = chunk_size, skip = (chunk_id - 1) * chunk_size, ...)
dat = dat[-c(1,5)] # delete row 1 and 5
write.csv(dat, append = TRUE, ....)
}
tot_no_lines = 10000 # for example
chunk_size = 1000
sapply(1:(tot_no_lines / chunk_size), read_chunk)
Note that this is R style code useful as inspiration, no working R code.
Solution 4:
You can build the output row dynamically:
for r in rdr:
outrow = []
for i in range(0, 628714):
outrow.append(r[i])
for i in range(650181, 653320):
outrow.append(r[i])
wtr.writerow( outrow )
I imagine you can do this even more concisely with slices of the input row r, along the lines of:
outrow = r[0:628714)
outrow.extend(r[650181:653320)
wrt.writerow( outrow )
Perhaps not the fastest to execute, but certainly easier to write.
Solution 5:
Are you on Linux? Then save the hazzle and use csvtool
from shell:
csvtool col 1-500,502-1000input.csv > output.csv
You can also set delimiter and so on, just type csvtool --help
. Quite easy to use.
Post a Comment for "Delete Multiple Columns From 500 Mb Tsv File With Python (or Perl Etc)"