Search For Text Contained In Any Row Of A Pandas Dataframe
I have the following DataFrame pred[['right_context', 'PERC']] Out[247]: right_context PERC 0 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.000197 1
Solution 1:
Series.str.contains
& str.upper
You cann use Series.str.contains
and join the column in _direcciones
as one string with |
as seperator.
Also important to note that we have to cast the string of dataframe pred
to uppercase with str.upper
pred['found?'] = pred['right_context'].str.upper()\
.str.contains('|'.join(_direcciones['Address']))
print(pred)
right_context PERC found?
0 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.000197False1 San Pedro xxxxxxxxxxxx 0.572630True2 zxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.572630False3 de San Pedro Este parcela 0.572630True4 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.035577False
Only get T
& F
pred['found?'] = pred['right_context'].str.upper()\
.str.contains('|'.join(_direcciones['Address']))\
.astype(str).str[:1]
print(pred)
right_context PERC found?0 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.000197F1 San Pedro xxxxxxxxxxxx 0.572630T2 zxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.572630F3 de San Pedro Este parcela 0.572630T4 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.035577F
Output of '|'.join
'|'.join(_direcciones['Address'])
'SAN PEDRO|bbbbbbbbbbbbbbbbbbbbbb|yyyyyyyyyyyyyyyyyyy'
Solution 2:
Use word boundaries with all strings joined by |
with Series.str.contains
and parameter case=False
:
pat = '|'.join(r"\b{}\b".format(x) for x in _direcciones['entity_content'])
pred['found?'] = pred['right_context'].str.contains(pat, case=False)
print (pred)
right_context PERC found?
0 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.000197False1 San Pedro xxxxxxxxxxxx 0.572630True2 zxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.572630False3 de San Pedro Este parcela 0.572630True4 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.035577False
If necessary add numpy.where
:
pat = '|'.join(r"\b{}\b".format(x) for x in _direcciones['entity_content'])
pred['found?'] = np.where(pred['right_context'].str.contains(pat, case=False), 'T', 'F')
print (pred)
right_context PERC found?
0 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.000197 F
1 San Pedro xxxxxxxxxxxx 0.572630 T
2 zxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.572630 F
3 de San Pedro Este parcela 0.572630 T
4 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.035577 F
Solution 3:
Try this approach, seems to work for me using small data sample:
from pprint import pprint
import numpy as np
import pandas as pd
defmain():
#Sample Data
df_right = pd.DataFrame({'right_context':'San Jose, San Pedro, San Pedro Este, Santani, Honolulu'.split(','),
'PERC': np.arange(5)})
directions = pd.DataFrame({'address':'SAN PEDRO, Djiloboji, Torres'.split(','),
'value': np.arange(3)})
# generate found result
found=(df_right['right_context'].str.contains('San Pedro', case=False)).tolist()
# Insert into original dataframe
df_right.insert(2,"found",found)
pprint(df_right)
if __name__== "__main__":
main()
Output:
right_context PERC found
0 San Jose 0False1 San Pedro 1True2 San Pedro Este 2True3 Santani 3False4 Honolulu 4False
Post a Comment for "Search For Text Contained In Any Row Of A Pandas Dataframe"