In Pandas, After Groupby The Grouped Column Is Gone
Solution 1:
For return DataFrame
after groupby
are 2 possible solutions:
parameter
as_index=False
what works nice withcount
,sum
,mean
functionsreset_index
for create new column from levels ofindex
, more general solution
df = ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
print (df)
clienthostid LoginDaysSum
0 1 4
1 3 2
df = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'].count().reset_index()
print (df)
clienthostid LoginDaysSum
0 1 4
1 3 2
For second need remove as_index=False
and instead add reset_index
:
#output is `Series`
a = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'] \
.apply(lambda x: x.iloc[0] / x.iloc[1])
print (a)
clienthostid
11.031.5
Name: LoginDaysSum, dtype: float64
print (type(a))
<class'pandas.core.series.Series'>
print (a.index)
Int64Index([1, 3], dtype='int64', name='clienthostid')
df1 = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum']
.apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
print (df1)
clienthostid ratio
011.0131.5
Why some columns are gone?
I think there can be problem automatic exclusion of nuisance columns:
#convert column to str
ttm.usersidid = ttm.usersidid.astype(str) + 'aa'print (ttm)
usersidid clienthostid eventSumTotal LoginDaysSum score
0 12aa 160317281 11aa 1240313313 5aa 1531254 6aa 11622162 10aa 3270310005 8aa 3182512#removed str column userid
a = ttm.groupby(['clienthostid'], sort=False).sum()
print (a)
eventSumTotal LoginDaysSum score
clienthostid
1321113400328851512
Solution 2:
count
is a built in method for the groupby
object and pandas knows what to do with it. There are two other things specified that goes into determining what the out put looks like.
# For a built in method, when# you don't want the group column# as the index, pandas keeps it in# as a column.# |----||||----|
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
clienthostid LoginDaysSum
014132
# For a built in method, when# you do want the group column# as the index, then...# |----||||---|
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].count()
# |-----||||-----|# the single brackets tells# pandas to operate on a series# in this case, count the series
clienthostid
1432
Name: LoginDaysSum, dtype: int64
ttm.groupby(['clienthostid'], as_index=True, sort=False)[['LoginDaysSum']].count()
# |------||||------|# the double brackets tells pandas# to operate on the dataframe# specified by these columns and will# return a dataframe
LoginDaysSum
clienthostid
1432
When you used apply
pandas no longer knows what to do with the group column when you say as_index=False
. It has to trust that if you use apply
you want returned exactly what you say to return, so it will just throw it away. Also, you have single brackets around your column which says to operate on a series. Instead, use as_index=True
to keep the grouping column information in the index. Then follow it up with a reset_index
to transfer it from the index back into the dataframe. At this point, it will not have mattered that you used single brackets because after the reset_index
you'll have a dataframe again.
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])
01.011.5
dtype: float64
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()
clienthostid LoginDaysSum
011.0131.5
Solution 3:
Reading the groupy documentarion, a found out that automatic exclusion of columns after groupby usually caused by the presence of null values in that columns excluded.
Try fill the 'null' with some value.
Like this:
df.fillna('')
Solution 4:
You simply need this instead:
ttm.groupby(['clienthostid'], as_index=False, sort=False)[['LoginDaysSum']].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()
The double [[]]
will turn the output into a pd.Dataframe instead of a pd.Series.
Post a Comment for "In Pandas, After Groupby The Grouped Column Is Gone"