Skip to content Skip to sidebar Skip to footer

Aggregate By Repeated Datetime Index With Different Identifiers In A Column On A Pandas Dataframe

I have a data frame in this form: value identifier 2007-01-01 0.781611 55 2007-01-01 0.766152 56 2007-01-01 0.766152 57 2007-02-01 0.705615 55

Solution 1:

In order for the groupby to return a df instead of a Series then use double subsription [[]]:

by_date = df.groupby(df.index.date)[['value']].mean()

this then allows you to groupby by month and generate a boxplot:

by_month = by_date.groupby(by_date.index.month)
by_month.boxplot(subplots=False)

The use of double subsription is a subtle feature which is not immediately obvious, generally doing df[col] will return a column, but we know that passing a list of columns col_list will return a df: df[col_list] which when expanded is the same as df[[col_a, col_b]] this then leads to the conclusion that we can return a df if we did the following: df[[col_a]] as we've passed a list with a single element, this is not the same as df[col_a] where we've passed a label to perform column indexing.

Solution 2:

When you did the groupby on date, you converted the index from a Timestamp to a datetime.date.

>>>type(df.index[0])
pandas.tslib.Timestamp

>>>type(by_date.index[0])
datetime.date

If you convert the index to Periods, you can groupby easily.

df.index=pd.DatetimeIndex(by_date.index).to_period('M')>>>df.groupby(df.index).value.sum()2007-01-01    2.3139152007-02-01    0.7698832008-01-01    2.0127602008-02-01    0.294140Name:value,dtype:float64

Post a Comment for "Aggregate By Repeated Datetime Index With Different Identifiers In A Column On A Pandas Dataframe"