Skip to content Skip to sidebar Skip to footer

How To Convert A Column Of String To Numerical?

I have this pandas dataframe from a query: | name | event | ---------------------------- | name_1 | event_1 | | name_1 | event_2 | | name_2 | event_

Solution 1:

Some ways of doing it

1)

In [366]: pd.crosstab(df.name, df.event)
Out[366]:
event   event_1  event_2
name
name_1        11
name_2        10

2)

In [367]: df.groupby(['name', 'event']).size().unstack(fill_value=0)
Out[367]:
event   event_1  event_2
name
name_1        11
name_2        10

3)

In [368]: df.pivot_table(index='name', columns='event', aggfunc=len, fill_value=0)
Out[368]:
event   event_1  event_2
name
name_1        11
name_2        10

4)

In [369]: df.assign(v=1).pivot(index='name', columns='event', values='v').fillna(0)
Out[369]:
event   event_1  event_2
name
name_1      1.01.0
name_2      1.00.0

Solution 2:

Option 1pir1 and pir1_5

df.set_index('name').event.str.get_dummies()

        event_1  event_2
name                    
name_1        10
name_1        01
name_2        10

Then you could sum across the index

df.set_index('name').event.str.get_dummies().sum(level=0)

        event_1  event_2
name                    
name_1        11
name_2        10

Option 2pir2 Or you could dot product

pd.get_dummies(df.name).T.dot(pd.get_dummies(df.event))

        event_1  event_2
name_1        11
name_2        10

Option 3pir3 Advanced Mode

i, r = pd.factorize(df.name.values)
j, c = pd.factorize(df.event.values)
n, m = r.size, c.size

b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)

pd.DataFrame(b, r, c)

        event_1  event_2
name_1        1        1
name_2        1        0

Timing

res.plot(loglog=True)

enter image description here

res.div(res.min(1),0)pir1pir2pir3john1john2john3109.9483963.3999131.020.4783684.46046610.642113309.3505242.6811781.016.5892483.8476669.16890710011.4145363.0794631.018.0760404.2777529.94930530015.7695942.9405291.016.7458893.9454709.0692651000   26.8694512.6175641.012.7895703.2363907.2792053000   42.2295422.0995411.08.7166002.4298474.7858141000052.5716781.7160881.04.5975981.6919892.8004553000058.6447641.4698271.02.8187441.5350121.929452

Functions

pir1 = lambda df: df.set_index('name').event.str.get_dummies().sum(level=0)
pir1_5 = lambda df: pd.get_dummies(df.set_index('name').event).sum(level=0)
pir2 = lambda df: pd.get_dummies(df.name).T.dot(pd.get_dummies(df.event))

defpir3(df):
    i, r = pd.factorize(df.name.values)
    j, c = pd.factorize(df.event.values)
    n, m = r.size, c.size

    b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)

    return pd.DataFrame(b, r, c)

john1 = lambda df: pd.crosstab(df.name, df.event)
john2 = lambda df: df.groupby(['name', 'event']).size().unstack(fill_value=0)
john3 = lambda df: df.pivot_table(index='name', columns='event', aggfunc='size', fill_value=0)

Test

res = pd.DataFrame(
    index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    columns='pir1 pir2 pir3 john1 john2 john3'.split(),
    dtype=float
)

for i in res.index:
    d = pd.concat([df] * i, ignore_index=True)
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        res.at[i, j] = timeit(stmt, setp, number=100)

Solution 3:

You are asking for the pythonic ways , i think in python this way is to use a technic called one-hot encoding this technic is well implemented in libraries likes sklearn and after one hot encoding you will need to group your dataframe by the first column and apply sum function.

here is a code :

import pandas as pd #the useful librariesimport numpy as np
from sklearn.preprocessing import LabelBinarizer #form sklmearn
dataset = pd.DataFrame([['name_1', 'event_1' ], ['name_1', 'event_2'], ['name_2', 'event_1']], columns=['name', 'event'], index=[1, 2, 3])
data = dataset['event'] #just reproduce your dataframe
enc = LabelBinarizer(neg_label=0)
dataset['event_2'] = enc.fit_transform(data)
event_two = dataset['event_2']
dataset['event_1'] = (~event_two.astype(np.bool)).astype(np.int64) #this is a tip to reproduce the event_1 columns
dataset = dataset.groupby('name').sum()
dataset.reset_index(inplace=True)

and the output is :

enter image description here

Post a Comment for "How To Convert A Column Of String To Numerical?"