Sample Rows Of Pandas Dataframe In Proportion To Counts In A Column
Solution 1:
You can use groupby and sample
sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=0.1))
Solution 2:
the following sample a total of N row where each group appear in its original proportion to the nearest integer, then shuffle and reset the index using:
df = pd.DataFrame(dict(
A=[1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4],
B=range(20)
))
Short and sweet:
df.sample(n=N, weights='A', random_state=1).reset_index(drop=True)
Long version
df.groupby('A', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(df))))).sample(frac=1).reset_index(drop=True)
Solution 3:
I was looking for similar solution. The code provided by @Vaishali works absolutely fine. What @Abdou's trying to do also makes sense when we want to extract samples from each group based on their proportions to the full data.
# original : 10% from each groupsample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=0.1))
# modified : sample size based on proportions of group sizen = df.shape[0]
sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=length(x)/n))
Solution 4:
This is not as simple as just grouping and using .sample
. You need to actually get the fractions first. Since you said that you are looking to grab 10% of the total numbers of rows in different proportions, you will need to calculate how much each group will have to take out from the main dataframe. For instance, if we use the divide you mentioned in the question, then group A
will end up with 1/20
for a fraction of the total number of rows, group B
will get 1/30
and group C
ends up with 1/60
. You can put these fractions in a dictionary and then use .groupby
and pd.concat
to concatenate the number of rows* from each group into a dataframe. You will be using the n
parameter from the .sample
method instead of the frac
parameter.
fracs = {'A': 1/20, 'B': 1/30, 'C': 1/60}
N = len(df)
pd.concat(dff.sample(n=int(fracs.get(i)*N)) for i,dff in df.groupby('group_id'))
Edit:
This is to highlight the importance in fulfilling the requirement that group_id A should have half of the sampled rows, group_id B two sixths of the sampled rows and group_id C one sixth of the sampled rows, regardless of the original group divides.
Starting with equal portions: each group starts with 40 rows
df1 = pd.DataFrame({'group_id': ['A','B', 'C']*40,
'vals': np.random.randn(120)})
N = len(df1)
fracs = {'A': 1/20, 'B': 1/30, 'C': 1/60}
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df1.groupby('group_id')))
# group_id vals# 12 A -0.175109# 51 A -1.936231# 81 A 2.057427# 111 A 0.851301# 114 A 0.669910# 60 A 1.226954# 73 B -0.166516# 82 B 0.662789# 94 B -0.863640# 31 B 0.188097# 101 C 1.802802# 53 C 0.696984print(df1.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))
# group_id vals# group_id# A 24 A 0.161328# 21 A -1.399320# 30 A -0.115725# 114 A 0.669910# B 34 B -0.348558# 7 B -0.855432# 106 B -1.163899# 79 B 0.532049# C 65 C -2.836438# 95 C 1.701192# 80 C -0.421549# 74 C -1.089400
First solution: 6 rows for group A (1/2 of the sampled rows), 4 rows for group B (one third of the sampled rows) and 2 rows for group C (one sixth of the sampled rows).
Second solution: 4 rows for each group (each one third of the sampled rows)
Working with differently sized groups: 40 for A, 60 for B and 20 for C
df2 = pd.DataFrame({'group_id': np.repeat(['A', 'B', 'C'], (40, 60, 20)),
'vals': np.random.randn(120)})
N = len(df2)
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df2.groupby('group_id')))
# group_id vals# 29 A 0.306738# 35 A 1.785479# 21 A -0.119405# 4 A 2.579824# 5 A 1.138887# 11 A 0.566093# 80 B 1.207676# 41 B -0.577513# 44 B 0.286967# 77 B 0.402427# 103 C -1.760442# 114 C 0.717776print(df2.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))
# group_id vals# group_id# A 4 A 2.579824# 32 A 0.451882# 5 A 1.138887# 17 A -0.614331# B 47 B -0.308123# 52 B -1.504321# 42 B -0.547335# 84 B -1.398953# 61 B 1.679014# 66 B 0.546688# C 105 C 0.988320# 107 C 0.698790
First solution: consistent Second solution: Now group B has taken 6 of the sampled rows when it's supposed to only take 4.
Working with another set of differently sized groups: 60 for A, 40 for B and 20 for C
df3 = pd.DataFrame({'group_id': np.repeat(['A', 'B', 'C'], (60, 40, 20)),
'vals': np.random.randn(120)})
N = len(df3)
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df3.groupby('group_id')))
# group_id vals# 48 A 1.214525# 19 A -0.237562# 0 A 3.385037# 11 A 1.948405# 8 A 0.696629# 39 A -0.422851# 62 B 1.669020# 94 B 0.037814# 67 B 0.627173# 93 B 0.696366# 104 C 0.616140# 113 C 0.577033print(df3.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))
# group_id vals# group_id# A 4 A 0.284448# 11 A 1.948405# 8 A 0.696629# 0 A 3.385037# 31 A 0.579405# 24 A -0.309709# B 70 B -0.480442# 69 B -0.317613# 96 B -0.930522# 80 B -1.184937# C 101 C 0.420421# 106 C 0.058900
This is the only time the second solution offered some consistency (out of sheer luck, I might add).
I hope this proves useful.
Post a Comment for "Sample Rows Of Pandas Dataframe In Proportion To Counts In A Column"