How To Filter Overlap Of Two Rows In A Big File In Python
Solution 1:
I've been thinking about how to solve this problem in a better way, without all the reversing and indexing and stuff, and I've come up with a solution that's longer and more involved, but easier to read, prettier, more maintainable and extendable, IMHO.
First we need a special kind of list that we can iterate over "correctly", even if an item in it is deleted. Here is a blog post going into more detail about how lists and iterators work, and reading it will help you understand what's going on here:
classSmartList(list):def__init__(self, *args, **kwargs):
super(SmartList, self).__init__(*args, **kwargs)
self.iterators = []
def__iter__(self):
return SmartListIter(self)
def__delitem__(self, index):
super(SmartList, self).__delitem__(index)
for iterator inself.iterators:
iterator.item_deleted(index)
We extend the built-in list
and make it return a custom iterator instead of the default. Whenever an item in the list is deleted, we call the item_deleted
method of every item in the self.iterators
list. Here's the code for SmartListIter
:
classSmartListIter(object):
def__init__(self, smartlist, index=0):
self.smartlist = smartlist
smartlist.iterators.append(self)
self.index = index
def__iter__(self):
return self
defnext(self):
try:
item = self.smartlist[self.index]
except IndexError:
self.smartlist.iterators.remove(self)
raise StopIteration
index = self.index
self.index += 1return (index, item)
defitem_deleted(self, index):
if index >= self.index:
return
self.index -= 1
So the iterator adds itself to the list of iterators, and removes itself from the same list when it is done. If an item with an index less than the current index is deleted, we decrement the current index by one so that we won't skip an item like a normal list iterator would do.
The next
method returns a tuple (index, item)
instead of just the item, because that makes things easier when it's time to use these classes -- we won't have to mess around with enumerate
.
So that should take care of having to go backwards, but we're still having to use a lot of indexes to juggle between four different lines in every loop. Since two and two lines go together, let's make a class for that:
classLinePair(object):
def__init__(self, pair):
self.pair = pair
self.sets = [set(line.split()) for line in pair]
self.c = len(self.sets[0]) * len(self.sets[1])
defoverlap(self, other):
ab = float(len(self.sets[0] & other.sets[0]) * \
len(self.sets[1] & other.sets[1]))
overlap = ab / (self.c + other.c - ab)
return overlap
def__str__(self):
return"".join(self.pair)
The pair
attribute is a tuple of two lines read directly from the input file, complete with newlines. We use it later to write that pair back to a file. We also convert both lines to a set and calculate the c
attribute, which is a property of every pair of lines. Finally we make a method that will compute the overlap between one LinePair and another. Notice that d
is gone, since that is just the c
attribute of the other pair.
Now for the grand finale:
from itertools import izip
withopen("iuputfile.txt") as fileobj:
pairs = SmartList([LinePair(pair) for pair in izip(fileobj, fileobj)])
for first_index, first_pair in pairs:
for second_index, second_pair in SmartListIter(pairs, first_index + 1):
if first_pair.overlap(second_pair) > 0.25:
del pairs[second_index]
withopen("outputfile.txt", "w") as fileobj:
for index, pair in pairs:
fileobj.write(str(pair))
Notice how easy it is to read the central loop here, and how short it is. If you need to change this algorithm in the future, it's likely much more easily done with this code than with my other code. The izip
is used to group two and two lines of the input file, as explained here.
Solution 2:
Stack Overflow is not here to program for you or to solve general debugging tasks. It's for specific problems that you've tried to solve yourself but can't. You're asking questions you should, as a programmer, be able to figure out yourself. Start your program like this:
python -m pdb my_script.py
Now you can step through your script line by line using the n
command. If you want to see what's inside a variable, simply type the name of that variable. By using this method you will find out why things don't work yourself. There are lots of other smart things you can do using pdb (the python debugger) but for this case the n
command is sufficient.
Please put in some more effort towards solving your problem yourself before asking another question here.
That being said, here is what was wrong with your modified script:
withopen("iuputfile.txt") as fileobj:
sets = [set(line.split()) for line in fileobj]
for first_index inrange(len(sets) - 4, -2, -2):
c = len(sets[first_index]) * len(sets[first_index + 1])
for second_index inrange(len(sets) - 2, first_index, -2):
d = len(sets[second_index]) * len(sets[second_index + 1])
# Error 1:
ab = len(sets[first_index] & sets[second_index]) * \
len(sets[first_index + 1] & sets[second_index + 1])
# Error 2:
overlap = (float(ab) / (c + d - ab))
if overlap > 0.25:
# Error 3:del sets[second_index + 1]
del sets[second_index]
withopen("outputfile.txt", "w") as fileobj:
for set_ in sets:
# You've removed the sorting, I assume it's because the order# is unimportant
output = " ".join(set_)
fileobj.write("{0}\n".format(output))
The mistakes were:
- Error 1: Intersection is
&
. Union is|
. - Error 2: Since all of the variables are integers, the result will be an integer too, unless you're using python 3. If you are, this is not an error. If you're not, you need to make sure one of the variables are a float to force the result to be a float as well. Hence the
float(ab)
. - Error 3: Remember to always work from the back and forwards. When you delete
sets[second_index]
, what used to be atsets[second_index + 1]
takes it place, so deletingsets[second_index + 1]
afterwards will delete what used to be atsets[second_index + 2]
, which is not what you want. So we delete the largest index first.
Post a Comment for "How To Filter Overlap Of Two Rows In A Big File In Python"