Encoding Categorical Variables: Label vs One-Hot Encoding

When working on Machine Learning models, one common challenge you’ll encounter is handling categorical data. Most ML algorithms work only with numbers, so converting categorical variables (like β€œGender”, β€œCity”, or β€œYes/No” types) into a numerical format is essential.

In this post, we’ll explore two popular encoding techniques:

  • Label Encoding
  • One-Hot Encoding

Let’s understand how and when to use each.


What is Categorical Data?

Categorical data refers to variables that contain label values rather than numeric values. Examples include:

  • Gender: Male, Female
  • City: Delhi, Mumbai, Kolkata
  • Education Level: High School, Graduate, Postgraduate

Since ML models can’t process strings directly, we convert these to numbers.


1. Label Encoding

Label Encoding converts each category into a unique integer value.

Example:

CityEncoded
Delhi0
Mumbai1
Kolkata2

Python implementation using scikit-learn:

from sklearn.preprocessing import LabelEncoder

data = ['Delhi', 'Mumbai', 'Kolkata']
encoder = LabelEncoder()
encoded = encoder.fit_transform(data)
print(encoded)

Pros:

  • Simple and quick.
  • Does not increase dataset size.

Cons:

  • Introduces an ordinal relationship between values, which may mislead the model.
    For instance, Mumbai > Delhi numerically, but this has no real-world meaning.

2. One-Hot Encoding

One-Hot Encoding creates a new binary column for each category.

Example:

CityDelhiMumbaiKolkata
Delhi100
Mumbai010
Kolkata001

Python implementation using pandas:

import pandas as pd

df = pd.DataFrame({'City': ['Delhi', 'Mumbai', 'Kolkata']})
encoded = pd.get_dummies(df, columns=['City'])
print(encoded)

Pros:

  • No ordinal relationship introduced.
  • Good for nominal categorical data.

Cons:

  • Increases dataset size, especially with many unique values.

Which One Should You Use?

SituationUse
Categories have meaningful order (Low to High)Label
No natural order in categoriesOne-Hot
Limited number of unique categoriesOne-Hot
Large number of categoriesLabel or advanced methods like Embedding

Conclusion

Encoding categorical variables correctly is key to building accurate ML models. While Label Encoding is simple and space-efficient, One-Hot Encoding avoids false assumptions about category order. Understanding the nature of your data will help you choose the right method.