Skip to content Skip to sidebar Skip to footer

Search For Text Contained In Any Row Of A Pandas Dataframe

I have the following DataFrame pred[['right_context', 'PERC']] Out[247]: right_context PERC 0 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0.000197 1

Solution 1:

Series.str.contains & str.upper

You cann use Series.str.contains and join the column in _direcciones as one string with | as seperator.

Also important to note that we have to cast the string of dataframe pred to uppercase with str.upper

pred['found?'] = pred['right_context'].str.upper()\
                                      .str.contains('|'.join(_direcciones['Address']))

print(pred)
                          right_context      PERC  found?
0  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.000197False1                San Pedro xxxxxxxxxxxx  0.572630True2          zxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.572630False3             de San Pedro Este parcela  0.572630True4   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.035577False

Only get T & F

pred['found?'] = pred['right_context'].str.upper()\
                                      .str.contains('|'.join(_direcciones['Address']))\
                                      .astype(str).str[:1]

print(pred)
                          right_context      PERC found?0  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.000197F1                San Pedro xxxxxxxxxxxx  0.572630T2          zxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.572630F3             de San Pedro Este parcela  0.572630T4   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.035577F

Output of '|'.join

'|'.join(_direcciones['Address'])

'SAN PEDRO|bbbbbbbbbbbbbbbbbbbbbb|yyyyyyyyyyyyyyyyyyy'

Solution 2:

Use word boundaries with all strings joined by | with Series.str.contains and parameter case=False:

pat = '|'.join(r"\b{}\b".format(x) for x in _direcciones['entity_content'])
pred['found?'] = pred['right_context'].str.contains(pat, case=False)
print (pred)
                          right_context      PERC  found?
0  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.000197False1                San Pedro xxxxxxxxxxxx  0.572630True2          zxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.572630False3             de San Pedro Este parcela  0.572630True4   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.035577False

If necessary add numpy.where:

pat = '|'.join(r"\b{}\b".format(x) for x in _direcciones['entity_content'])
pred['found?'] = np.where(pred['right_context'].str.contains(pat, case=False), 'T', 'F')
print (pred)
                          right_context      PERC found?
0  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.000197      F
1                San Pedro xxxxxxxxxxxx  0.572630      T
2          zxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.572630      F
3             de San Pedro Este parcela  0.572630      T
4   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0.035577      F

Solution 3:

Try this approach, seems to work for me using small data sample:

from pprint import pprint
import numpy as np
import pandas as pd

defmain():
    #Sample Data
    df_right = pd.DataFrame({'right_context':'San Jose, San Pedro, San Pedro Este, Santani, Honolulu'.split(','),
                       'PERC': np.arange(5)})
    directions = pd.DataFrame({'address':'SAN PEDRO, Djiloboji, Torres'.split(','),
                       'value': np.arange(3)})
    # generate found result
    found=(df_right['right_context'].str.contains('San Pedro', case=False)).tolist()
    # Insert into original dataframe
    df_right.insert(2,"found",found)
    pprint(df_right)

if __name__== "__main__":
    main()

Output:

     right_context  PERC  found
0         San Jose     0False1        San Pedro     1True2   San Pedro Este     2True3          Santani     3False4         Honolulu     4False

Post a Comment for "Search For Text Contained In Any Row Of A Pandas Dataframe"