Skip to content Skip to sidebar Skip to footer

Get Cumulative Count Per 2d Array

I have general data, e.g. strings: np.random.seed(343) arr = np.sort(np.random.randint(5, size=(10, 10)), axis=1).astype(str) print (arr) [['0' '1' '1' '2' '2' '3' '3' '4' '4' '4'

Solution 1:

General Idea

Consider the generic case where we perform this cumulative counting or if you think of them as ranges, we could call them - Grouped ranges.

Now, the idea starts off simple - Compare one-off slices along the respective axis to look for inequalities. Pad with True at the start of each row/col (depending on axis of counting).

Then, it gets complicated - Setup an ID array with the intention that we would a final cumsum which would be desired output in its flattened order. So, the setup starts off with initializing a 1s array with same shape as input array. At each group start in input, offset the ID array with the previous group lengths. Follow the code (should give more insight) on how we would do it for each row -

defgrp_range_2drow(a, start=0):
    # Get grouped ranges along each row with resetting at places where# consecutive elements differ# Input(s) : a is 2D input array# Store shape info
    m,n = a.shape
    
    # Compare one-off slices for each row and pad with True's at starts# Those True's indicate start of each group
    p = np.ones((m,1),dtype=bool)
    a1 = np.concatenate((p, a[:,:-1] != a[:,1:]),axis=1)
    
    # Get indices of group starts in flattened version
    d = np.flatnonzero(a1)

    # Setup ID array to be cumsumed finally for desired o/p # Assign into starts with previous group lengths. # Thus, when cumsumed on flattened version would give us flattened desired# output. Finally reshape back to 2D  
    c = np.ones(m*n,dtype=int)
    c[d[1:]] = d[:-1]-d[1:]+1
    c[0] = start
    return c.cumsum().reshape(m,n)

We would extend this to solve for a generic case of row and columns. For the columns case, we would simply transpose, feed to earlier row-solution and finally transpose back, like so -

def grp_range_2d(a, start=0, axis=1):
    # Get grouped ranges along specified axis with resetting at places where
    # consecutive elements differ
    
    # Input(s) : a is2D input array

    if axis notin [0,1]:
        raise Exception("Invalid axis")

    if axis==1:
        return grp_range_2drow(a, start=start)
    else:
        return grp_range_2drow(a.T, start=start).T

Sample run

Let's consider a sample run as would find grouped ranges along each column with each group starting with 1 -

In [330]: np.random.seed(0)

In [331]: a = np.random.randint(1,3,(10,10))

In [333]: a
Out[333]: 
array([[1, 2, 2, 1, 2, 2, 2, 2, 2, 2],
       [2, 1, 1, 2, 1, 1, 1, 1, 1, 2],
       [1, 2, 2, 1, 1, 2, 2, 2, 2, 1],
       [2, 1, 2, 1, 2, 2, 1, 2, 2, 1],
       [1, 2, 1, 2, 2, 2, 2, 2, 1, 2],
       [1, 2, 2, 2, 2, 1, 2, 1, 1, 2],
       [2, 1, 2, 1, 2, 1, 1, 1, 1, 1],
       [2, 2, 1, 1, 1, 2, 2, 1, 2, 1],
       [1, 2, 1, 2, 2, 2, 2, 2, 2, 1],
       [2, 2, 1, 1, 2, 1, 1, 2, 2, 1]])

In [334]: grp_range_2d(a, start=1, axis=0)
Out[334]: 
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 2],
       [1, 1, 1, 1, 2, 1, 1, 1, 1, 1],
       [1, 1, 2, 2, 1, 2, 1, 2, 2, 2],
       [1, 1, 1, 1, 2, 3, 1, 3, 1, 1],
       [2, 2, 1, 2, 3, 1, 2, 1, 2, 2],
       [1, 1, 2, 1, 4, 2, 1, 2, 3, 1],
       [2, 1, 1, 2, 1, 1, 1, 3, 1, 2],
       [1, 2, 2, 1, 1, 2, 2, 1, 2, 3],
       [1, 3, 3, 1, 2, 1, 1, 2, 3, 4]])

Thus, to solve our case for dataframe input & output, it would be -

out= grp_range_2d(df.values, start=1,axis=0)
pd.DataFrame(out,columns=df.columns,index=df.index)

Solution 2:

And the numba solution. For such tricky problem, it always wins, here by a 7x factor vs numpy, since only one pass on res is done.

from numba import njit 
@njitdefthefunc(arrc):
    m,n=arrc.shape
    res=np.empty((m+1,n),np.uint32)
    res[0]=1for i inrange(1,m+1):
        for j inrange(n):
            if arrc[i-1,j]:
                res[i,j]=res[i-1,j]+1else : res[i,j]=1return res 

defnumbering(arr):return thefunc(arr[1:]==arr[:-1])

I need to externalize arr[1:]==arr[:-1] since numba doesn't support strings.

In [75]: %timeit numbering(arr)13.7 µs ± 373 ns per loop(mean ± std. dev. of 7 runs, 100000 loops each)

In [76]: %timeit grp_range_2dcol(arr)111 µs ± 18.3 µs per loop(mean ± std. dev. of 7 runs, 10000 loops each)

For bigger array (100 000 rows x 100 cols), the gap is not so wide :

In [168]: %timeit a=grp_range_2dcol(arr)
1.54 s ± 11.4 ms per loop (mean ± std. dev. of7 runs, 1loopeach)

In [169]: %timeit a=numbering(arr)
625 ms ± 43.4 ms per loop (mean ± std. dev. of7 runs, 1loopeach)

If arr can be convert to 'S8', we can win a lot of time :

In [398]: %timeit arr[1:]==arr[:-1]
584 ms ± 12.6 ms per loop (mean ± std. dev. of7 runs, 1loopeach)

In [399]: %timeit arr.view(np.uint64)[1:]==arr.view(np.uint64)[:-1]
196 ms ± 18.4 ms per loop (mean ± std. dev. of7 runs, 1loopeach)

Solution 3:

Using the method of Divakar column wise is pretty faster, even so there is probably a fully vectorized way.

#function of Divakar
def grp_range(a):
    idx = a.cumsum()
    id_arr = np.ones(idx[-1],dtype=int)
    id_arr[0] = 0
    id_arr[idx[:-1]] = -a[:-1]+1
    return id_arr.cumsum()

#create the equivalent of (df != df.shift()).cumsum() but faster
arr_sum = np.vstack([np.ones(10), np.cumsum((arr != np.roll(arr, 1, 0))[1:],0)+1])

#use grp_range column wise on arr_sum
arr_result = np.array([grp_range(np.unique(arr_sum[:,i],return_counts=1)[1]) 
                       for i in range(arr_sum.shape[1])]).T+1

To check the equality:

# of the cumsumprint (((df != df.shift()).cumsum() == 
         np.vstack([np.ones(10), np.cumsum((arr != np.roll(arr, 1, 0))[1:],0)+1]))
         .all().all())
#Trueprint ((df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1) ==
        np.array([grp_range(np.unique(arr_sum[:,i],return_counts=1)[1]) 
                  for i inrange(arr_sum.shape[1])]).T+1)
        .all().all())
#True

and the speed:

%timeit df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1)
#19.4 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
arr_sum = np.vstack([np.ones(10), np.cumsum((arr != np.roll(arr, 1, 0))[1:],0)+1])
arr_res = np.array([grp_range(np.unique(arr_sum[:,i],return_counts=1)[1]) 
                    for i in range(arr_sum.shape[1])]).T+1#562 µs ± 82.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

EDIT: with Numpy, you can also use np.maximum.accumulate with np.arange.

defaccumulate(arr):
    n,m = arr.shape
    arr_arange = np.arange(1,n+1)[:,np.newaxis]
    return np.concatenate([ np.ones((1,m)), 
                           arr_arange[1:] - np.maximum.accumulate(arr_arange[:-1]*
                      (arr[:-1,:] != arr[1:,:]))],axis=0)

Some TIMING

arr_100 = np.sort(np.random.randint(50, size=(100000, 100)), axis=1).astype(str)

Solution with np.maximum.accumulate

%timeit accumulate(arr_100)
#520 ms ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution of Divakar

%timeit grp_range_2drow(arr_100.T, start=1).T#1.15 s ± 64.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution with Numba of B. M.

%timeit numbering(arr_100)
#228 ms ± 31.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Post a Comment for "Get Cumulative Count Per 2d Array"