Skip to content Skip to sidebar Skip to footer

Check If Words In One Dataframe Appear In Another (python 3, Pandas)

Problem: I have two data frames and want to remove any duplicates/partial duplicates between them. DF1 DF2 **Phrases** **Phrases** Little Red

Solution 1:

I found this solution below. For now, it's not very elegant but it works.

import pandas as pd

df1 = pd.DataFrame(['Little Red', 'Grow Your', 'James Bond', 'Tom Brady'])
df2 = pd.DataFrame(['Little Red Corvette', 'Grow Your Beans', 'James Dean'])

# For each element of df1, if we found a left(df2, len(df1)) = df1, we# apply df1 = df2# Remark that the column name is 0for i inrange(int(df1.count())):
    for j inrange(int(df2.count())):
        if df1.loc[i, 0] == df2.loc[j, 0][:len(df1.loc[i, 0])]:
            df1.loc[i, 0] = df2.loc[j, 0]

# Finaly we merge df1 and df2 by union of the keys.# Here the column name is 0
df3 = df2.merge(df1, how='outer', on=0, sort=True, copy=False)

The DataFrame df3 is what you need.

Solution 2:

You could bisect the values after sorting:

import pandas as pd

df1 = pd.DataFrame(['Little Red', 'Grow Your', 'James Bond', 'Tom Brady'])
df2 = pd.DataFrame(['Little Red Corvette', 'Grow Your Beans', 'James Dean'])

from bisect import bisect_left

deffind_common(df1, df2):
    vals = df2.values
    vals.sort(0)
    for i, row in df1.iterrows():
        val = row.values
        ind = bisect_left(vals, val, hi=len(vals) - 1)
        if val[0] notin vals[ind][0]:
            yield val[0]


df3 = df2.append(pd.DataFrame(find_common(df1, df2)),ignore_index=True)
print(df3)

Output:

                     0
0      Grow Your Beans
1           James Dean
2  Little Red Corvette
3           James Bond
4            Tom Brady

sorting gives you a O(N log N) solution as opposed to O(n^2) iterating over every string in df1 every time you get a check a string from df2

Solution 3:

I would first do an outer merge on the dataframes. I am not sure whether DF1 refers to the column name or the dataframe varaiable name in your posting, but for simplicity I assume you have two dataframes which have columns with strings:

df1 
#        words#0  little red#1   grow your#2  james bond#3  tom brandy

df2 
#                 words#0  little red corvette#1      grow your beans#2           james dean#3               little

Next, make a new dataframe that merges these two (use an outer merge). This takes care of the duplicates

df3 = pandas.merge( df1, df2, on='words', how='outer')
#                 words#0           little red#1            grow your#2           james bond#3           tom brandy#4  little red corvette#5      grow your beans#6           james dean#7               little

Next you want to use the Series.str.get_dummies method:

dummies = df3.words.str.get_dummies(sep='')
#   grow your  grow your beans  james bond  james dean  little  little red  \#0          0                0           0           0       1           1   #1          1                0           0           0       0           0   #2          0                0           1           0       0           0   #3          0                0           0           0       0           0   #4          0                0           0           0       1           1   #5          1                1           0           0       0           0   #6          0                0           0           1       0           0   #7          0                0           0           0       1           0   #   little red corvette  tom brandy  #0                    0           0  #1                    0           0  #2                    0           0  #3                    0           1  #4                    1           0  #5                    0           0  #6                    0           0  #7                    0           0 

Notice, if a string contains no other sub-strings in the words column, or if is the super-string of 1 or more sub-strings, then it's column will sum to 1 - otherwise it will sum to a number > 1. Now you can use this dummies dataframe to find the indices corresponding to the sub-strings and remove them:

bad_rows = [where(df3.words==word)[0][0] 
            for word in list(dummies) 
            if dummies[word].sum() > 1 ]  # only substrings will sum to greater than 1#[1, 7, 0]

df3.drop( df3.index[bad_rows] , inplace=True)
#                 words#2           james bond#3           tom brandy#4  little red corvette#5      grow your beans#6           james dean

Note- this takes care of the case where you have more than 1 sub-string of a super-string. For instance 'little', 'little red' are both sub-strings of the super-string 'little red corvette', so I assume you would only keep the super-string.

Post a Comment for "Check If Words In One Dataframe Appear In Another (python 3, Pandas)"