Pandas Dataframe: Return Row And Column Of Maximum Value(s)
I have a dataframe in which all values are of the same variety (e.g. a correlation matrix -- but where we expect a unique maximum). I'd like to return the row and the column of the
Solution 1:
I figured out the first part:
npa = df.as_matrix()
cols,indx = np.where(npa == np.amax(npa))([df.columns[c]forcin cols],[df.index[c]forcin indx])
Now I need a way to get the top n. One naive idea is to copy the array, and iteratively replace the top values with NaN
grabbing index as you go. Seems inefficient. Is there a better way to get the top n values of a numpy array? Fortunately, as shown here there is, through argpartition
, but we have to use flattened indexing.
deftopn(df,n):
npa = df.as_matrix()
topn_ind = np.argpartition(npa,-n,None)[-n:] #flatend ind, unsorted
topn_ind = topn_ind[np.argsort(npa.flat[topn_ind])][::-1] #arg sort in descending order
cols,indx = np.unravel_index(topn_ind,npa.shape,'F') #unflatten, using column-major orderingreturn ([df.columns[c] for c in cols],[df.index[i] for i in indx])
Trying this on the example:
>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>topn(df,3)
(['b', 'c', 'b'], ['e', 'f', 'f'])
As desired. Mind you the sorting was not originally asked for, but provides little overhead if n
is not large.
Solution 2:
what you want to use is stack
df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
df = df.stack()
df.sort(ascending=False)
df.head(4)
e b 9
f c 8
b 7
a 6
dtype: int64
Solution 3:
I guess for what you are trying to do a DataFrame might not be the best choice, since the idea of the columns in the DataFrame is to hold independent data.
>>> deftopn(df,n):
# pull the data ouit of the DataFrame# and flatten it to an array
vals = df.values.flatten(order='F')
# next we sort the array and store the sort mask
p = np.argsort(vals)
# create two arrays with the column names and indexes# in the same order as vals
cols = np.array([[col]*len(df.index) for col in df.columns]).flatten()
idxs = np.array([list(df.index) for idx in df.index]).flatten()
# sort and return cols, and idxsreturn cols[p][:-(n+1):-1],idxs[p][:-(n+1):-1]
>>> topn(df,3)
(array(['b', 'c', 'b'],
dtype='|S1'),
array(['e', 'f', 'f'],
dtype='|S1'))
>>> %timeit(topn(df,3))
10000 loops, best of 3: 29.9 µs per loop
watsonics solution takes slightly less
%timeit(topn(df,3))
10000 loops, best of 3: 24.6 µs per loop
but way faster than stack
def topStack(df,n):
df = df.stack()
df.sort(ascending=False)
return df.head(n)
%timeit(topStack(df,3))
1000 loops, best of 3: 1.91 ms per loop
Post a Comment for "Pandas Dataframe: Return Row And Column Of Maximum Value(s)"