Remove Duplicate Rows From A Large File In Python

December 19, 2023 Post a Comment

I've a csv file that I want to remove duplicate rows from, but it's too large to fit into memory. I found a way to get it done, but my guess is that it's not the best way. Each ro

Solution 1:

If you want a really simple way to do this, just create a sqlite database:

import sqlite3
conn = sqlite3.connect('single.db')
cur = conn.cursor()
cur.execute("""create table test(
f1 text,
f2 text,
f3 text,
f4 text,
f5 text,
f6 text,
f7 text,
f8 text,
f9 text,
f10 text,
f11 text,
f12 text,
f13 text,
f14 text,
f15 text,
primary key(f1,  f2,  f3,  f4,  f5,  f6,  f7,  
            f8,  f9,  f10,  f11,  f12,  f13,  f14,  f15))
"""
conn.commit()

#simplified/pseudo codefor row in reader:
    #assuming row returns a list-type objecttry:
        cur.execute('''insert into test values(?, ?, ?, ?, ?, ?, ?, 
                       ?, ?, ?, ?, ?, ?, ?, ?)''', row)
        conn.commit()
    except IntegrityError:
        pass

conn.commit()
cur.execute('select * from test')

for row in cur:
    #write row to csv file

Then you wouldn't have to worry about any of the comparison logic yourself - just let sqlite take care of it for you. It probably won't be much faster than hashing the strings, but it's probably a lot easier. Of course you'd modify the type stored in the database if you wanted, or not as the case may be. Of course since you're already converting the data to a string you could just have one field instead. Plenty of options here.

Solution 2:

You are basically doing a merge sort, and removing duplicated entries.

Breaking the input into memory-sized pieces, sorting each of piece, then merging the pieces while removing duplicates is a sound idea in general.

Actually, up to a couple of gigs I would let the virtual memory system handle it and just write:

input = open(infilename, 'rb')
output = open(outfile, 'wb')

for key,  group in itertools.groupby(sorted(input)):
    output.write(key)

Solution 3:

Your current method is not guaranteed to work properly.

Firstly, there is the small probability that two lines that are actually different can produce the same hash value. hash(a) == hash(b) does not always mean that a == b

Secondly, you are making the probability higher with your "reduce/lambda" caper:

>>>reduce(lambda x,y: x+y, ['foo', '1', '23'])
'foo123'
>>>reduce(lambda x,y: x+y, ['foo', '12', '3'])
'foo123'
>>>

BTW, wouldn't "".join(['foo', '1', '23']) be somewhat clearer?

BTW2, why not use a set instead of a dict for htable?

Here's a practical solution: get the "core utils" package from the GnuWin32 site, and install it. Then:

write a copy of your file without the headings to (say) infile.csv
c:\gnuwin32\bin\sort --unique -ooutfile.csv infile.csv
read outfile.csv and write a copy with the headings prepended

For each of steps 1 & 3, you could use a Python script, or some of the other GnuWin32 utilities (head, tail, tee, cat, ...).

Solution 4:

Your original solution is slightly incorrect: you could have different lines hashing to the same value (a hash collision), and your code would leave one of them out.

In terms of algorithmic complexity, if you're expecting relatively few duplicates, I think the fastest solution would be to scan the file line by line, adding the hash of each line (as you did), but also storing the location of that line. Then when you encounter a duplicate hash, seek to the original place to make sure that it is a duplicate and not just a hash collision, and if so, seek back and skip the line.

By the way, if the CSV values are normalized (i.e., records are considered equal iff the corresponding CSV rows are equivalent byte-for-byte), you need not involve CSV parsing here at all, just deal with plain text lines.

Solution 5:

Since I suppose you'll have to do this on a somewhat regular basis (or you'd have hacked a once-over script), and you mentioned you were interested in a theoretical solution, here's a possibility.

Read the input lines into B-Trees, ordered by each input line's hash value, writing them to disk when memory fills. We take care to store, on the B-Trees, the original lines attached to the hash (as a set, since we only care about unique lines). When we read a duplicate element, we check the lines set on the stored element and add it if it's a new line that happens to hash to the same value.

Why B-Trees? They requires less disk reads when you only can (or want) to read parts of them to memory. The degree (number of children) on each node depends on the available memory and number of lines, but you don't want to have too many nodes.

Once we have those B-Trees on disk, we compare the lowest element from each of them. We remove the lowest of all, from all B-Trees that have it. We merge their lines sets, meaning that we have no duplicates left for those lines (and also that we have no more lines that hash to that value). We then write the lines from this merge into the output csv structure.

We can separate half of the memory for reading the B-Trees, and half to keep the output csv in memory for some time. We flush the csv to disk when its half is full, appending to whatever has already been written. How much of each B-Tree we read on each step can be roughly calculated by (available_memory / 2) / number_of_btrees, rounded so we read full nodes.

In pseudo-Python:

ins = DictReader(...)
i = 0while ins.still_has_lines_to_be_read():
    tree = BTree(i)
    while fits_into_memory:
        line = ins.readline()
        tree.add(line, key=hash)
    tree.write_to_disc()
    i += 1
n_btrees = i

# At this point, we have several (n_btres) B-Trees on diskwhile n_btrees:
    n_bytes = (available_memory / 2) / n_btrees
    btrees = [read_btree_from_disk(i, n_bytes)
              for i inenumerate(range(n_btrees))]
    lowest_candidates = [get_lowest(b) for b in btrees]
    lowest = min(lowest_candidates)
    lines = set()
    for i inrange(number_of_btrees):
        tree = btrees[i]
        if lowest == lowest_candidates[i]:
            node = tree.pop_lowest()
            lines.update(node.lines)
        if tree.is_empty():
        n_btrees -= 1if output_memory_is_full or n_btrees == 0:
        outs.append_on_disk(lines)

lacucinadiadine