Reading Huge File In Python

August 07, 2024 Post a Comment

I have a 384MB text file with 50 million lines. Each line contains 2 space-separated integers: a key and a value. The file is sorted by key. I need an efficient way of looking up t

Solution 1:

If you only need 200 of 50 million lines, then reading all of it into memory is a waste. I would sort the list of search keys and then apply binary search to the file using seek() or something similar. This way you would not read the entire file to memory which I think should speed things up.

Solution 2:

Slight optimization of S.Lotts answer:

from collections import defaultdict
keyValues= defaultdict(list)
targetKeys= # some list of keys as strings
for line in fin:
    key, value = line.split()
    ifkeyin targetKeys:
        keyValues[key].append( value )

Since we're using a dictionary rather than a list, the keys don't have to be numbers. This saves the map() operation and a string to integer conversion for each line. If you want the keys to be numbers, do the conversion a the end, when you only have to do it once for each key, rather than for each of 50 million lines.

Solution 3:

It's not clear what "list[pointer]" is all about. Consider this, however.

from collections import defaultdict
keyValues= defaultdict(list)
targetKeys= # some list of keysfor line in fin:
    key, value = map( int, line.split())
    if key in targetKeys:
        keyValues[key].append( value )

Solution 4:

I would use memory-maping: http://docs.python.org/library/mmap.html. This way you can use the file as if it's stored in memory, but the OS decides which pages should actually be read from the file.

Baca Juga

Solution 5:

Here is a recursive binary search on the text file

import os, stat

classIntegerKeyTextFile(object):
    def__init__(self, filename):
        self.filename = filename
        self.f = open(self.filename, 'r')
        self.getStatinfo()

    defgetStatinfo(self):
        self.statinfo = os.stat(self.filename)
        self.size = self.statinfo[stat.ST_SIZE]

    defparse(self, line):
        key, value = line.split()
        k = int(key)
        v = int(value)
        return (k,v)

    def__getitem__(self, key):
        return self.findKey(key)

    deffindKey(self, keyToFind, startpoint=0, endpoint=None):
        "Recursively search a text file"if endpoint isNone:
            endpoint = self.size

        currentpoint = (startpoint + endpoint) // 2whileTrue:
            self.f.seek(currentpoint)
            if currentpoint <> 0:
                # may not start at a line break! Discard.
                baddata = self.f.readline() 

            linestart = self.f.tell()
            keyatpoint = self.f.readline()

            ifnot keyatpoint:
                # read returned empty - end of fileraise KeyError('key %d not found'%(keyToFind,))

            k,v = self.parse(keyatpoint)

            if k == keyToFind:
                print'key found at ', linestart, ' with value ', v
                return v

            if endpoint == startpoint:
                    raise KeyError('key %d not found'%(keyToFind,))

            if k > keyToFind:
                return self.findKey(keyToFind, startpoint, currentpoint)
            else:
                return self.findKey(keyToFind, currentpoint, endpoint)

A sample text file created in jEdit seems to work:

>>>i = integertext.IntegerKeyTextFile('c:\\sampledata.txt')>>>i[1]
key found at  0  with value  345
345

It could definitely be improved by caching found keys and using the cache to determine future starting seek points.

lacucinadiadine