How To Convert Text In Pandas Dataframe (delete Punctuation, Split Text Into One Word Per Entry)

April 06, 2024 Post a Comment

I am cleaning data from a .txt source. The file is including WhatsApp messages in every line, including date and time stamp. I already split all of that into one column holding dat

Solution 1:

Use:

import re

#https://stackoverflow.com/a/49146722
emoji_pattern = re.compile("["u"\U0001F600-\U0001F64F"# emoticonsu"\U0001F300-\U0001F5FF"# symbols & pictographsu"\U0001F680-\U0001F6FF"# transport & map symbolsu"\U0001F1E0-\U0001F1FF"# flags (iOS)u"\U00002702-\U000027B0"u"\U000024C2-\U0001F251""]+", flags=re.UNICODE)

df['new'] = (df['text_new'].str.lower() #lowercase
                           .str.replace(r'[^\w\s]+', '') #rem punctuation 
                           .str.replace(emoji_pattern, '') #rem emoji
                           .str.strip() #rem trailing whitespaces
                           .str.split()) #split by whitespaces

Sample:

df = pd.DataFrame({'text_new':['How are you?',
                               'I am fine, we should meet this afternoon!',
                               'Okay let us do that. \U0001f602']})


emoji_pattern = re.compile("["
                       u"\U0001F600-\U0001F64F"# emoticons
                       u"\U0001F300-\U0001F5FF"# symbols & pictographs
                       u"\U0001F680-\U0001F6FF"# transport & map symbols
                       u"\U0001F1E0-\U0001F1FF"# flags (iOS)
                       u"\U00002702-\U000027B0"
                       u"\U000024C2-\U0001F251""]+", flags=re.UNICODE)

import re


df['new'] = (df['text_new'].str.lower()
                           .str.replace(r'[^\w\s]+', '')
                           .str.replace(emoji_pattern, '')
                           .str.strip()
                           .str.split())
print (df)
                                    text_new  \
0                               How are you?   
1  I am fine, we should meet this afternoon!   
2                     Okay let us do that. 😂   

                                                new  
0                                   [how, are, you]  
1  [i, am, fine, we, should, meet, this, afternoon]  
2                         [okay, let, us, do, that]

EDIT:

df['new'] = (df['text_new'].str.lower()
                           .str.replace(r'[^\w\s]+', '')
                           .str.replace(emoji_pattern, '')
                           .str.strip())
print (df)
                                    text_new  \
0                               How are you?   
1  I am fine, we should meet this afternoon!   
2                     Okay let us do that. 😂   

                                       new  
0                              how are you  
1  i am fine we should meet this afternoon  
2                      okay let us do that

lacucinadiadine

How To Convert Text In Pandas Dataframe (delete Punctuation, Split Text Into One Word Per Entry)

Solution 1:

Post a Comment for "How To Convert Text In Pandas Dataframe (delete Punctuation, Split Text Into One Word Per Entry)"

Widget HTML #3