Label Encoding in Pandas

Label Encoding .

importance: Machine learning models work on mathematical functions. Mathematical functions don't understand strings. They should be numeric to be added or subtracted. Label Encoding is process of encoding strings or any type to Numbers.

This is alternative to Label encoder in sklearn.preprocessing.LabelEncoder

In [126]:
from sklearn.preprocessing import LabelEncoder

X = {'countries':['india','nepal','china'],'a':[134,4422,333],'b':[22,33,44]}
sdf = pd.DataFrame(X)
sdf.head()
Out[126]:
countries a b
0 india 134 22
1 nepal 4422 33
2 china 333 44
In [127]:
labelencoder_X = LabelEncoder()

Dataset has to be converted into n-dimensional array First.

In [128]:
sdf = sdf.values
In [129]:
sdf
Out[129]:
array([['india', 134, 22],
       ['nepal', 4422, 33],
       ['china', 333, 44]], dtype=object)
In [133]:
sdf[:,0] = labelencoder_X.fit_transform(sdf[:,0])

converting into dataframe again

In [135]:
X = pd.DataFrame(sdf)
X
Out[135]:
0 1 2
0 1 134 22
1 2 4422 33
2 0 333 44
In [24]:
import pandas as pd
import numpy as np
In [2]:
bridge_type= "Arch cantilever suspended Truss Beam".split()
In [3]:
df = pd.DataFrame(bridge_type,columns=['bridge_type'])
In [4]:
df.head()
Out[4]:
bridge_type
0 Arch
1 cantilever
2 suspended
3 Truss
4 Beam

converting bridgetype to category and creating new column to store that encoded variables

In [18]:
df[list(df.columns)[0]] = df[list(df.columns)[0]].astype('category')
In [76]:
df['bridge_type_encoded']= df['bridge_type'].cat.codes

Since we don't want our model to understand that Beam is better than arch or suspended is better than cantilever. We might want to dummy them. Dummy in context that : instead of focusing on its precedence(1 , 2 ,3 ...) , convert them in such a way that they all carry same weight for our training model. i.e Arch is as good as Beam

In [28]:
vectored_X = pd.get_dummies(df['bridge_type_encoded'])
vectored_X
Out[28]:
0 1 2 3 4
0 1 0 0 0 0
1 0 0 0 1 0
2 0 0 0 0 1
3 0 0 1 0 0
4 0 1 0 0 0

We might want to concatenate dummied table and our original Table . lets copy df and drop The strings

In [36]:
df_copy = df.copy()
In [37]:
df_copy = df_copy.drop('bridge_type',axis='columns')
In [38]:
df_copy.head()
Out[38]:
bridge_type_encoded
0 0
1 3
2 4
3 2
4 1
In [39]:
df_cleaned = pd.concat([vectored_X,df_copy],axis='columns')
In [40]:
df_cleaned.head()
Out[40]:
0 1 2 3 4 bridge_type_encoded
0 1 0 0 0 0 0
1 0 0 0 1 0 3
2 0 0 0 0 1 4
3 0 0 1 0 0 2
4 0 1 0 0 0 1

Another case. Maybe we don't want our categorical data to be dummied. Let's say they carry some significance. If our data was related to heights [short,medium,tall]. Then they would carry some significance. Short can be thought as 0 and tall can be thought as 2.

In [45]:
heights="Short Medium Tall".split()
In [72]:
data={'Name':['Ashish','Anish','Milan','Gaurav','Nirmal','Avash','Ram'],'height':['short','medium','medium','Girrafe','tall','short','medium']}
In [73]:
df_people = pd.DataFrame(data)
In [74]:
df_height_categorize = pd.Categorical(df_people['height'].values,categories=['short','medium','tall'],ordered=True)
In [75]:
df_height_categorize
Out[75]:
[short, medium, medium, NaN, tall, short, medium]
Categories (3, object): [short < medium < tall]
In [84]:
df_height_categorize = pd.Series(df_height_categorize)
In [87]:
df_people['comprehendable Heights'] = df_height_categorize.cat.codes
In [88]:
df_people
Out[88]:
Name height comprehendable Heights
0 Ashish short 0
1 Anish medium 1
2 Milan medium 1
3 Gaurav Girrafe -1
4 Nirmal tall 2
5 Avash short 0
6 Ram medium 1
In [ ]:
 

This is Ashish Thapa. It's just a name after all. Just had a thought that our names should signify our situation and should be dynamic. What if our name upgraded with respect to experiences acquired.

Leave a Comment

Your email address will not be published. Required fields are marked *