Skip to content Skip to sidebar Skip to footer

In Pandas, After Groupby The Grouped Column Is Gone

I have the following dataframe named ttm: usersidid clienthostid eventSumTotal LoginDaysSum score 0 12 1 60 3 1728

Solution 1:

For return DataFrame after groupby are 2 possible solutions:

  1. parameter as_index=False what works nice with count, sum, mean functions

  2. reset_index for create new column from levels of index, more general solution

df = ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
print (df)
   clienthostid  LoginDaysSum
0             1             4
1             3             2
df = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'].count().reset_index()
print (df)
   clienthostid  LoginDaysSum
0             1             4
1             3             2

For second need remove as_index=False and instead add reset_index:

#output is `Series`
a = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'] \
         .apply(lambda x: x.iloc[0] / x.iloc[1])
print (a)
clienthostid
11.031.5
Name: LoginDaysSum, dtype: float64

print (type(a))
<class'pandas.core.series.Series'>

print (a.index)
Int64Index([1, 3], dtype='int64', name='clienthostid')


df1 = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum']
         .apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
print (df1)
   clienthostid  ratio
011.0131.5

Why some columns are gone?

I think there can be problem automatic exclusion of nuisance columns:

#convert column to str
ttm.usersidid = ttm.usersidid.astype(str) + 'aa'print (ttm)
  usersidid  clienthostid  eventSumTotal  LoginDaysSum  score
0      12aa             160317281      11aa             1240313313       5aa             1531254       6aa             11622162      10aa             3270310005       8aa             3182512#removed str column userid
a = ttm.groupby(['clienthostid'], sort=False).sum()
print (a)
              eventSumTotal  LoginDaysSum  score
clienthostid                                    
1321113400328851512

What is the difference between size and count in pandas?

Solution 2:

count is a built in method for the groupby object and pandas knows what to do with it. There are two other things specified that goes into determining what the out put looks like.

#                         For a built in method, when#                         you don't want the group column#                         as the index, pandas keeps it in#                         as a column.#                             |----||||----|
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()

   clienthostid  LoginDaysSum
014132

#                         For a built in method, when#                         you do want the group column#                         as the index, then...#                             |----||||---|
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].count()
#                                                       |-----||||-----|#                                                 the single brackets tells#                                                 pandas to operate on a series#                                                 in this case, count the series

clienthostid
1432
Name: LoginDaysSum, dtype: int64

ttm.groupby(['clienthostid'], as_index=True, sort=False)[['LoginDaysSum']].count()
#                                                       |------||||------|#                                             the double brackets tells pandas#                                                to operate on the dataframe#                                              specified by these columns and will#                                                return a dataframe

              LoginDaysSum
clienthostid              
1432

When you used apply pandas no longer knows what to do with the group column when you say as_index=False. It has to trust that if you use apply you want returned exactly what you say to return, so it will just throw it away. Also, you have single brackets around your column which says to operate on a series. Instead, use as_index=True to keep the grouping column information in the index. Then follow it up with a reset_index to transfer it from the index back into the dataframe. At this point, it will not have mattered that you used single brackets because after the reset_index you'll have a dataframe again.

ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])

01.011.5
dtype: float64

ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()

   clienthostid  LoginDaysSum
011.0131.5

Solution 3:

Reading the groupy documentarion, a found out that automatic exclusion of columns after groupby usually caused by the presence of null values in that columns excluded.

Try fill the 'null' with some value.

Like this:

df.fillna('')

Solution 4:

You simply need this instead:

ttm.groupby(['clienthostid'], as_index=False, sort=False)[['LoginDaysSum']].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()

The double [[]] will turn the output into a pd.Dataframe instead of a pd.Series.

Post a Comment for "In Pandas, After Groupby The Grouped Column Is Gone"