Encoding Categorical Variables: Label vs One-Hot Encoding

When working on Machine Learning models, one common challenge you’ll encounter is handling categorical data. Most ML algorithms work only with numbers, so converting categorical variables (like “Gender”, “City”, or “Yes/No” types) into a numerical format is essential.

In this post, we’ll explore two popular encoding techniques:

  • Label Encoding
  • One-Hot Encoding

Let’s understand how and when to use each.


What is Categorical Data?

Categorical data refers to variables that contain label values rather than numeric values. Examples include:

  • Gender: Male, Female
  • City: Delhi, Mumbai, Kolkata
  • Education Level: High School, Graduate, Postgraduate

Since ML models can’t process strings directly, we convert these to numbers.


1. Label Encoding

Label Encoding converts each category into a unique integer value.

Example:

CityEncoded
Delhi0
Mumbai1
Kolkata2

Python implementation using scikit-learn:

from sklearn.preprocessing import LabelEncoder

data = ['Delhi', 'Mumbai', 'Kolkata']
encoder = LabelEncoder()
encoded = encoder.fit_transform(data)
print(encoded)

Pros:

  • Simple and quick.
  • Does not increase dataset size.

Cons:

  • Introduces an ordinal relationship between values, which may mislead the model.
    For instance, Mumbai > Delhi numerically, but this has no real-world meaning.

2. One-Hot Encoding

One-Hot Encoding creates a new binary column for each category.

Example:

CityDelhiMumbaiKolkata
Delhi100
Mumbai010
Kolkata001

Python implementation using pandas:

import pandas as pd

df = pd.DataFrame({'City': ['Delhi', 'Mumbai', 'Kolkata']})
encoded = pd.get_dummies(df, columns=['City'])
print(encoded)

Pros:

  • No ordinal relationship introduced.
  • Good for nominal categorical data.

Cons:

  • Increases dataset size, especially with many unique values.

Which One Should You Use?

SituationUse
Categories have meaningful order (Low to High)Label
No natural order in categoriesOne-Hot
Limited number of unique categoriesOne-Hot
Large number of categoriesLabel or advanced methods like Embedding

Conclusion

Encoding categorical variables correctly is key to building accurate ML models. While Label Encoding is simple and space-efficient, One-Hot Encoding avoids false assumptions about category order. Understanding the nature of your data will help you choose the right method.