Skip to content Skip to sidebar Skip to footer

Clean Wrong Header Inside Dataframe With Python/pandas

I've got a corrupt data frame with random header duplicates inside the data frame. How to ignore or delete these rows while loading the data frame? Since this random header is in t

Solution 1:

Throw in na_filter = False to typecast your columns into strings. Then locate all rows with bad data then filter them out your dataframe.

>>>df = pd.read_csv('sample.csv', header = 0, na_filter = False)>>>df
   col1  col2  col3
0     0     1     1
1     0     0     0
2     1     1     1
3  col1  col2  col3
4     0     1     1
5     0     0     0
6     1     1     1
>>>type(df.iloc[0,0])
<class 'str'>

Now that you parsed your data in each column as strings, locate all col1, col2, and col3 values in your df, create a new column if you find them each column using np.where() as such:

>>> df['Tag'] = np.where(((df['col1'] != '0') & (df['col1'] != '1')) & ((df['col2'] != '0') & (df['col2'] != '1')) & ((df['col3'] != '0') & (df['col3'] != '1')), ['Remove'], ['Don\'t remove'])
>>> df
   col1  col2  col3           Tag
0     0     1     1  Don't remove
1     0     0     0  Don't remove
2     1     1     1  Don't remove
3  col1  col2  col3        Remove
4     0     1     1  Don't remove
5     0     0     0  Don't remove
6     1     1     1  Don't remove

Now, filter out the one tagged as Removed in the Tag column using isin().

>>> df2 = df[~df['Tag'].isin(['Remove'])]
>>> df2
  col1 col2 col3           Tag
0011  Don't remove
1000  Don't remove
2111  Don't remove
4011  Don't remove
5000  Don't remove
6111  Don't remove

Drop the Tag column:

>>>df2 = df2[['col1', 'col2', 'col3']]>>>df2
  col1 col2 col3
0    0    1    1
1    0    0    0
2    1    1    1
4    0    1    1
5    0    0    0
6    1    1    1

Finally typecast your dataframe into int, if you need it to be an integer:

>>>df2 = df2.astype(int)>>>df2
   col1  col2  col3
0     0     1     1
1     0     0     0
2     1     1     1
4     0     1     1
5     0     0     0
6     1     1     1
>>>type(df2['col1'][0])
<class 'numpy.int32'>

Note: If you want standard index use:

>>>df2.reset_index(inplace = True, drop = True)>>>df2
   col1  col2  col3
0     0     1     1
1     0     0     0
2     1     1     1
3     0     1     1
4     0     0     0
5     1     1     1

Solution 2:

You just need to do the following: Assuming df_raw is your original dataframe with the column headers present both as the column names and repeating in several other rows, your corrected dataframe is df.

# Filter out only the rows without the headers in them.headers = df_raw.columns.tolist()
df = df_raw[df_raw[headers[0]]!=headers[0]].reset_index(drop=True)

Assumption: - We assume that the appearance of the first column header means that row has to be dropped.

In Detail Now a detailed code block for anyone to - create the data, - write it into a csv file, - load it as a dataframe, and then - remove rows that are headers.

import numpy as np
import pandas as pd

# make a csv file to load as dataframe
data = '''col1, col2, col3
0, 1, 1
0, 0, 0
1, 1, 1
col1, col2, col3
0, 1, 1
0, 0, 0
1, 1, 1'''# Write the data to a csv filewithopen('data.csv', 'w') as f:
    f.write(data)
# Load your data with header=None
df_raw = pd.read_csv('data.csv', header=None)

# Declare which row to find the header data: #    assuming the top one, we set this to zero.
header_row_number = 0# Read in columns headers
headers = df_raw.iloc[header_row_number].tolist()

# Set new column headers
df_raw.columns = headers

# Filter out only the rows without the headers in them# We assume that the appearance of the # first column header means that row has to be dropped# And reset index (and drop the old index column)
df = df_raw[df_raw[headers[0]]!=headers[0]].reset_index(drop=True)
df

Solution Output

Post a Comment for "Clean Wrong Header Inside Dataframe With Python/pandas"