Skip to content Skip to sidebar Skip to footer

Improve Performance: Remove All Strings In A (big) List Appearing Only Once

I have a big list (25000 items, 14000 words) like this: before (please have a look below for the right list): texts = ['Lorem hello ipsum', 'Lorem ipsum generator machine', ... 'h

Solution 1:

You can use collections.Counter here to store the count of each word and then filter out words based on this count. Using collection.Counter you can get the count of items in O(N) time, while your current approach(list.count) takes O(N**2) time.

And never use sum for flattening a list of lists, it is very slow(In your actual code you've list of strings, and sum() will raise error for that). I've used nested list comprehension in my answer, and if you actually have a list of lists then it's better to use itertools.chain.from_iterable here.

>>>from collections import Counter>>>texts = ['Lorem hello ipsum', 'Lorem ipsum generator machine', 'hello Lorem ipsum']>>>c = Counter(word for x in texts for word in x.split())>>>[' '.join(y for y in x.split() if c[y] > 1) for x in texts]
['Lorem hello ipsum', 'Lorem ipsum', 'hello Lorem ipsum']

Timing comparison:

In [8]: texts = ['Lorem hello ipsum', 'Lorem ipsum generator machine', 'hello Lorem ipsum']

In [9]: huge_texts = [x.split()*100 for x in texts]*1000  #list of lists                            

In [10]: %%timeit
from collections import Counter
from itertools import chain
c = Counter(chain.from_iterable(huge_texts))
texts = [[word for word in x if c[word]>1] for x in huge_texts]

1 loops, best of 3: 791 ms per loop

In [11]: %%timeit 
all_tokens = sum(huge_texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in huge_texts]                                                        

1 loops, best of 3: 20.4 s per loop 

Solution 2:

Both your code and the counter have a lot of hidden machinery: I recommend coming up with a good algorithm and then thinking about the Python a bit.

Let's take a look at your code:

tokens_once =set(word for word inset(all_tokens) if all_tokens.count(word) ==1)

Here we are making a set of all_tokens. If Python's set uses a hash table then you can build the set in O(N) time. If Python's set uses a binary tree of some sort, then you can build the set in O(N log N) time.

Okay, so you have N' unique elements. You now count how many times each element appears in all_tokens. That means you're doing O(N*N') work, or O(N^2) for short.

That's bad.

How can we do better?

One way would be to sort all_tokens.

Now we know that duplicate elements must be adjacent to each other in the sorted list. Therefore, we just have to walk along the list and see if an element is the same as the element preceding it: if so, we have a duplicate. Sorting takes O(N log N) time and finding the duplicates afterwards takes O(N) time. Copying the duplicated elements into a new vector takes O(1) amortized time, so the algorithm operates in O(N log N) time, which is quite good.

lorem="lorem ipsum dolor sit amet consetetur sadipscing elitr sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat sed diam voluptua at vero eos et accusam et justo duo dolores et ea rebum stet clita kasd gubergren no sea takimata sanctus est lorem ipsum dolor sit amet"

The algorithm for that looks like this:

lorem="lorem ipsum dolor sit amet consetetur sadipscing elitr sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat sed diam voluptua at vero eos et accusam et justo duo dolores et ea rebum stet clita kasd gubergren no sea takimata sanctus est lorem ipsum dolor sit amet"

all_tokens=lorem.split()

all_tokens.sort()

duplicate_count=0
more_than_once=[]
for i in range(1,len(all_tokens)):
  if all_tokens[i-1]==all_tokens[i]:
    duplicate_count+=1
    if duplicate_count==1:
      more_than_once.append(all_tokens[i])
  else:
    duplicate_count=0

Can we do better? Yes!

Imagine that we have a hash table that allows us to insert/retrieve an element in O(1) time. We can walk down a list of items in O(N) time and increment their entry in the hash table every time we see the item. We can then retrieve the list of duplicated items in O(N) time.

What does that look like?

hash_table={}
for i in all_tokens:
  if hash_table.has_key(i):
    hash_table[i]+=1
  else:
    hash_table[i]=1

more_than_once=[i for i in hash_table if hash_table[i]>=2]

In the case that Python's dictionary is not implemented as a hash table it will be implemented as a kind of binary search tree and this code falls back to being an O(N log N) algorithm.

Of course there are all kinds of fancy containers and libraries that Python could use to abstract this process, but, on a deep level, those will all be doing something a lot like what I've just described.

Personally, I prefer to find a good algorithm for solving a problem before looking for the language-specific features that will speed up an operation.

Solution 3:

The slow part is here:

`all_tokens.count(word) == 1`

For every single word, you are asking the code to walk the entire list and count how many items are the current word. i.e.:

list1 = [1,2,3,1,5]

So, the first word is 1. To know how many 1s are in this list, we have to check all 5 to see if they are 1. We'll get back a count of 2 - and we've done 5 comparisons so far. Then we check 2 - have to compare against every entry to count how many are 2, and the total is 1.

The counter is the right method, just wanted to explain more of why.

Post a Comment for "Improve Performance: Remove All Strings In A (big) List Appearing Only Once"