Image taken from unsplash.com
Introduction
Any structured dataset includes both numerical as well as categorical variables. But as machines don’t understand categorical variables, we need to convert categorical variables to numerical values. This process is called categorical encoding.
There are two common techniques for encoding categorical variables - One-Hot-Encoding and Label Encoding. Choice of encoding methods can significantly impact the performance of a machine learning model.
Let's first understand the categorical variables and its types.
Categorical Variables are those variables which reflects qualitative characteristics. It can be either nominal or ordinal types.
Nominal categorical variable : A nominal variable is one whose values are categorical without any order existence. For example -- Colors, Gender, Ethnicity.
Ordinal categorical variable : Ordinal variables are those whose categories are in order. For example -- Class rank, T-shirt size, Temperature range , Degree in education.
What is Label Encoding ?
Label Encoding is a technique for converting nominal categorical variables (where no order exists) into numerical values. Each category is assigned to a unique integer value.
Example of Label Encoding
Suppose we are working on a categorical data ‘gender’ with male and female category. Here, we encode labels, just by assigning them a number to represent that category (1 for male and 2 for female).
In Ordinal encoding, each category is assigned a unique integer based on some ordering.
Example of Ordinal Encoding
Let’s assume we’re working with categorical data ‘Weather’, like ‘cold’, warm, and ‘hot’. In Ordinal encoding, we give them a number to represent that category (1 for ‘cold’, 2 for ‘warm’ and 3 for ‘hot’).
Limitation of Label Encoding
In label encoding, the categories develop a natural order of relationships. It gives higher weights to higher numbers. Let’s understand it by this example:
Imagine we have 3 categories of colors: red, green, and blue. By using label encoding, let’s assign each of these a number: red = 1, green = 2, and blue = 3. Now, if model calculates the average of these categories, it might do 1+2+3 = 6/3 = 2. This means that according to the model, the average is green. Obviously, it results in a completely wrong relationship. To overcome this limitation, we need to introduce one-hot encoding.
One Hot Encoding
One Hot encoding is a process of converting categorical data variables so that they can be provided to machine learning algorithms to improve predictions.
One Hot encoding technique is used when the features are nominal (do not have any order). In one hot encoding, for every category, a new variable is created. Categorical features are mapped with a binary variable containing either 0 or 1. Here, 0 represents the absence, and 1 represents the presence of that category.
These newly created binary features are known as Dummy variables. This is also known as Dummy encoding. The number of dummy variables depends on the number of categories present.
Let’s understand One Hot Encoding by solving one problem.
Predict the price of Mercedes Benz that is 4 yr old with a mileage of 45000.
For this, I am using the dataset “carprices.csv” which I downloaded from the GitHub.
It involves following steps:
1. First, import the pandas library in Jupyter notebook.
2. Upload the file and read the file .
3. Get the dummy variables .
Here, for each category, (Audi A5, BMW X5, Mercedez Benz C class) a new variable is created. These newly created binary features are dummy variables.
4. Merge the dummy variables with the original data frame .
5. Now, drop the categorical variable “Car Model” and any one dummy variable. Here I am dropping “Mercedez Benz C class” to avoid dummy variable trap.
6. For building a model, we need dependent and independent variable. In our example Mileage, Age, and dummy variables are independent variables whereas Sell Price is dependent variable.
Independent variable
Dependent Variable
7. Next step is to import Linear Regression model from sklearn library.
8. Fit the model by putting dependent and independent variables. It trains X and y.
9. Check the Accuracy of model.
Here, we can see the model gives the result with 94% accuracy.
Now, by applying this model let’s predict the price of Mercedes Benz that is 4 yr old with a mileage of 45000.
10. Predict the Price
So, by using one hot encoding we found the price of Mercedes Benz that is 4 yr old with a mileage of 45000 is around 37000 $.
Above example demonstrates use of one hot encoding used for creating model for data with categorical values.
When to Use One-Hot-Encoding
When categorical data has no ordinal relationship (e.g., colors, product categories).
For machine learning models that require numerical input.
When there are a manageable number of categories, avoiding high-dimensional data.
Drawbacks of One-Hot-Encoding
It leads to increased dimensionality, as a separate column is created for each category in the variable. This makes the model more complex and slow to train.
It is not useful for ordinal categorical features present in the data .