Skip to content Skip to sidebar Skip to footer

Sort Pandas Dataframe By String Column That Represents (mostly) Numbers?

I have data similar to this. data = [ dict(name = 'test1', index = '1' , status='fail'), dict(name = 'test3', index = '3', status='pass'), dict(name = 'test1', index = '11', status

Solution 1:

This will sort by the name and a temporary column (__ix) that is the first integer found (consecutive digits) in each 'index' string:

Update: You can also use:

df = (
    df
    .assign(
        __ix=df['index'].str.extract(r'([0-9]+)').astype(int)
    )
    .sort_values(['name', '__ix'])
    .drop('__ix', axis=1)  # optional: remove the tmp column
    .reset_index(drop=True)  # optional: leaves the index scrambled
)

Original:

df = (
    df
    .assign(
        __ix=df['index']
        .apply(lambda s: int(re.match(r'\D*(\d+)', s).group(0)))
    )
    .sort_values(['name', '__ix'])
    .drop('__ix', axis=1)
    .reset_index(drop=True)
)

On your data (thanks for providing an easy reproducible example), first check what that __ix column is:

df['index'].apply(lambda s: int(re.match(r'\D*(\d+)', s).group(0)))
# out:0113211314205265

After sorting, your df becomes:

    name      index status
0  test1          1   fail
1  test1  121456   fail
2  test1          2   fail
3  test1         11pass4  test3          3pass5  test3     5:1:50pass6  test3         20   fail

Solution 2:

One possibility is to make a column that will give you the length of the index.

df['sort'] = df['index'].str.len()
df['sort2'] = df['index'].str[0]
df1 = df.sort_values(by=['name','sort','sort2'])
df1 = df1.drop(columns = ['sort','sort2'])

Post a Comment for "Sort Pandas Dataframe By String Column That Represents (mostly) Numbers?"