Pandas. Picking A Column Name Based On Row Data
In my previous question, i was trying to count blanks and build a dataframe with new columns for the subsequent analysis. The question became too exhaustive and i decided to split
Solution 1:
Here is a function that may be helpful, IIUC.
import pandas as pd
# create test data
t = pd.DataFrame({'x': [10, 20] + [None] * 3 + [30, 40, 50, 60] + [None] * 5 + [70]})
Create a function to find start location, end location, and size of each 'group', where a group is a sequence of repeated values (e.g., NaNs):
def extract_nans(df, field):
df = df.copy()
# identify NaNsdf['is_na'] = df[field].isna()
# identify groups (sequence of identical values is a group): X Y X => 3 groupsdf['group_id'] = (df['is_na'] ^ df['is_na'].shift(1)).cumsum()
# how many members in this group?df['group_size'] = df.groupby('group_id')['group_id'].transform('size')
# initial, final index of each groupdf['min_index'] = df.reset_index().groupby('group_id')['index'].transform(min)
df['max_index'] = df.reset_index().groupby('group_id')['index'].transform(max)
returndf
Results:
summary = extract_nans(t, 'x')
print(summary)
x is_na group_id group_size min_index max_index
010.0False0201120.0False02012 NaN True13243 NaN True13244 NaN True1324530.0False2458640.0False2458750.0False2458860.0False24589 NaN True3591310 NaN True3591311 NaN True3591312 NaN True3591313 NaN True359131470.0False411414
Now, you can exclude 'x' from the summary, drop duplicates, filter to keep only NaN values (is_na == True), filter to keep sequences above a certain length (e.g., at least 3 consecutive NaN values), etc. Then, if you drop duplicates, the first row will summarize the first NaN run, second row will summarize the second NaN run, etc.
Finally, you can use this with apply() to process the whole data frame, if this is what you need.
Short version of results, for the test data frame:
print(summary[summary['is_na']].drop(columns='x').drop_duplicates())
is_na group_id group_size min_index max_index
2True13249True35913
Post a Comment for "Pandas. Picking A Column Name Based On Row Data"