Check If Words In One Dataframe Appear In Another (python 3, Pandas)
Solution 1:
I found this solution below. For now, it's not very elegant but it works.
import pandas as pd
df1 = pd.DataFrame(['Little Red', 'Grow Your', 'James Bond', 'Tom Brady'])
df2 = pd.DataFrame(['Little Red Corvette', 'Grow Your Beans', 'James Dean'])
# For each element of df1, if we found a left(df2, len(df1)) = df1, we# apply df1 = df2# Remark that the column name is 0for i inrange(int(df1.count())):
for j inrange(int(df2.count())):
if df1.loc[i, 0] == df2.loc[j, 0][:len(df1.loc[i, 0])]:
df1.loc[i, 0] = df2.loc[j, 0]
# Finaly we merge df1 and df2 by union of the keys.# Here the column name is 0
df3 = df2.merge(df1, how='outer', on=0, sort=True, copy=False)
The DataFrame df3 is what you need.
Solution 2:
You could bisect
the values after sorting:
import pandas as pd
df1 = pd.DataFrame(['Little Red', 'Grow Your', 'James Bond', 'Tom Brady'])
df2 = pd.DataFrame(['Little Red Corvette', 'Grow Your Beans', 'James Dean'])
from bisect import bisect_left
deffind_common(df1, df2):
vals = df2.values
vals.sort(0)
for i, row in df1.iterrows():
val = row.values
ind = bisect_left(vals, val, hi=len(vals) - 1)
if val[0] notin vals[ind][0]:
yield val[0]
df3 = df2.append(pd.DataFrame(find_common(df1, df2)),ignore_index=True)
print(df3)
Output:
0
0 Grow Your Beans
1 James Dean
2 Little Red Corvette
3 James Bond
4 Tom Brady
sorting gives you a O(N log N)
solution as opposed to O(n^2)
iterating over every string in df1 every time you get a check a string from df2
Solution 3:
I would first do an outer merge on the dataframes. I am not sure whether DF1
refers to the column name or the dataframe varaiable name in your posting, but for simplicity I assume you have two dataframes which have columns with strings:
df1
# words#0 little red#1 grow your#2 james bond#3 tom brandy
df2
# words#0 little red corvette#1 grow your beans#2 james dean#3 little
Next, make a new dataframe that merges these two (use an outer merge). This takes care of the duplicates
df3 = pandas.merge( df1, df2, on='words', how='outer')
# words#0 little red#1 grow your#2 james bond#3 tom brandy#4 little red corvette#5 grow your beans#6 james dean#7 little
Next you want to use the Series.str.get_dummies
method:
dummies = df3.words.str.get_dummies(sep='')
# grow your grow your beans james bond james dean little little red \#0 0 0 0 0 1 1 #1 1 0 0 0 0 0 #2 0 0 1 0 0 0 #3 0 0 0 0 0 0 #4 0 0 0 0 1 1 #5 1 1 0 0 0 0 #6 0 0 0 1 0 0 #7 0 0 0 0 1 0 # little red corvette tom brandy #0 0 0 #1 0 0 #2 0 0 #3 0 1 #4 1 0 #5 0 0 #6 0 0 #7 0 0
Notice, if a string contains no other sub-strings in the words
column, or if is the super-string of 1 or more sub-strings, then it's column will sum to 1 - otherwise it will sum to a number > 1. Now you can use this dummies
dataframe to find the indices corresponding to the sub-strings and remove them:
bad_rows = [where(df3.words==word)[0][0]
for word in list(dummies)
if dummies[word].sum() > 1 ] # only substrings will sum to greater than 1#[1, 7, 0]
df3.drop( df3.index[bad_rows] , inplace=True)
# words#2 james bond#3 tom brandy#4 little red corvette#5 grow your beans#6 james dean
Note- this takes care of the case where you have more than 1 sub-string of a super-string. For instance 'little'
, 'little red'
are both sub-strings of the super-string 'little red corvette'
, so I assume you would only keep the super-string.
Post a Comment for "Check If Words In One Dataframe Appear In Another (python 3, Pandas)"