Skip to content Skip to sidebar Skip to footer

Pandas. Picking A Column Name Based On Row Data

In my previous question, i was trying to count blanks and build a dataframe with new columns for the subsequent analysis. The question became too exhaustive and i decided to split

Solution 1:

Here is a function that may be helpful, IIUC.

import pandas as pd

# create test data
t = pd.DataFrame({'x': [10, 20] + [None] * 3 + [30, 40, 50, 60] + [None] * 5 + [70]})

Create a function to find start location, end location, and size of each 'group', where a group is a sequence of repeated values (e.g., NaNs):

def extract_nans(df, field):
    df = df.copy()
    
    # identify NaNsdf['is_na'] = df[field].isna()

    # identify groups (sequence of identical values is a group):  X Y X => 3 groupsdf['group_id'] = (df['is_na'] ^ df['is_na'].shift(1)).cumsum()

    # how many members in this group?df['group_size'] = df.groupby('group_id')['group_id'].transform('size')

    # initial, final index of each groupdf['min_index'] = df.reset_index().groupby('group_id')['index'].transform(min)
    df['max_index'] = df.reset_index().groupby('group_id')['index'].transform(max)

    returndf

Results:

summary = extract_nans(t, 'x')
print(summary)

       x  is_na  group_id  group_size  min_index  max_index
010.0False0201120.0False02012    NaN   True13243    NaN   True13244    NaN   True1324530.0False2458640.0False2458750.0False2458860.0False24589    NaN   True3591310   NaN   True3591311   NaN   True3591312   NaN   True3591313   NaN   True359131470.0False411414

Now, you can exclude 'x' from the summary, drop duplicates, filter to keep only NaN values (is_na == True), filter to keep sequences above a certain length (e.g., at least 3 consecutive NaN values), etc. Then, if you drop duplicates, the first row will summarize the first NaN run, second row will summarize the second NaN run, etc.

Finally, you can use this with apply() to process the whole data frame, if this is what you need.

Short version of results, for the test data frame:

print(summary[summary['is_na']].drop(columns='x').drop_duplicates())
   is_na  group_id  group_size  min_index  max_index
2True13249True35913

Post a Comment for "Pandas. Picking A Column Name Based On Row Data"