Label Encoding .
importance: Machine learning models work on mathematical functions. Mathematical functions don't understand strings. They should be numeric to be added or subtracted. Label Encoding is process of encoding strings or any type to Numbers.
This is alternative to Label encoder in sklearn.preprocessing.LabelEncoder
from sklearn.preprocessing import LabelEncoder
X = {'countries':['india','nepal','china'],'a':[134,4422,333],'b':[22,33,44]}
sdf = pd.DataFrame(X)
sdf.head()
labelencoder_X = LabelEncoder()
Dataset has to be converted into n-dimensional array First.
sdf = sdf.values
sdf
sdf[:,0] = labelencoder_X.fit_transform(sdf[:,0])
converting into dataframe again
X = pd.DataFrame(sdf)
X
import pandas as pd
import numpy as np
bridge_type= "Arch cantilever suspended Truss Beam".split()
df = pd.DataFrame(bridge_type,columns=['bridge_type'])
df.head()
converting bridgetype to category and creating new column to store that encoded variables
df[list(df.columns)[0]] = df[list(df.columns)[0]].astype('category')
df['bridge_type_encoded']= df['bridge_type'].cat.codes
Since we don't want our model to understand that Beam is better than arch or suspended is better than cantilever. We might want to dummy them. Dummy in context that : instead of focusing on its precedence(1 , 2 ,3 ...) , convert them in such a way that they all carry same weight for our training model. i.e Arch is as good as Beam
vectored_X = pd.get_dummies(df['bridge_type_encoded'])
vectored_X
We might want to concatenate dummied table and our original Table . lets copy df and drop The strings
df_copy = df.copy()
df_copy = df_copy.drop('bridge_type',axis='columns')
df_copy.head()
df_cleaned = pd.concat([vectored_X,df_copy],axis='columns')
df_cleaned.head()
Another case. Maybe we don't want our categorical data to be dummied. Let's say they carry some significance. If our data was related to heights [short,medium,tall]. Then they would carry some significance. Short can be thought as 0 and tall can be thought as 2.
heights="Short Medium Tall".split()
data={'Name':['Ashish','Anish','Milan','Gaurav','Nirmal','Avash','Ram'],'height':['short','medium','medium','Girrafe','tall','short','medium']}
df_people = pd.DataFrame(data)
df_height_categorize = pd.Categorical(df_people['height'].values,categories=['short','medium','tall'],ordered=True)
df_height_categorize
df_height_categorize = pd.Series(df_height_categorize)
df_people['comprehendable Heights'] = df_height_categorize.cat.codes
df_people