Encoding Categorical Variables: Label vs One-Hot Encoding
When working on Machine Learning models, one common challenge you’ll encounter is handling categorical data. Most ML algorithms work only with numbers, so converting categorical variables (like “Gender”, “City”, or “Yes/No” types) into a numerical format is essential.
In this post, we’ll explore two popular encoding techniques:
- Label Encoding
- One-Hot Encoding
Let’s understand how and when to use each.
What is Categorical Data?
Categorical data refers to variables that contain label values rather than numeric values. Examples include:
- Gender: Male, Female
- City: Delhi, Mumbai, Kolkata
- Education Level: High School, Graduate, Postgraduate
Since ML models can’t process strings directly, we convert these to numbers.
1. Label Encoding
Label Encoding converts each category into a unique integer value.
Example:
City | Encoded |
---|---|
Delhi | 0 |
Mumbai | 1 |
Kolkata | 2 |
Python implementation using scikit-learn:
from sklearn.preprocessing import LabelEncoder
data = ['Delhi', 'Mumbai', 'Kolkata']
encoder = LabelEncoder()
encoded = encoder.fit_transform(data)
print(encoded)
Pros:
- Simple and quick.
- Does not increase dataset size.
Cons:
- Introduces an ordinal relationship between values, which may mislead the model.
For instance, Mumbai > Delhi numerically, but this has no real-world meaning.
2. One-Hot Encoding
One-Hot Encoding creates a new binary column for each category.
Example:
City | Delhi | Mumbai | Kolkata |
---|---|---|---|
Delhi | 1 | 0 | 0 |
Mumbai | 0 | 1 | 0 |
Kolkata | 0 | 0 | 1 |
Python implementation using pandas:
import pandas as pd
df = pd.DataFrame({'City': ['Delhi', 'Mumbai', 'Kolkata']})
encoded = pd.get_dummies(df, columns=['City'])
print(encoded)
Pros:
- No ordinal relationship introduced.
- Good for nominal categorical data.
Cons:
- Increases dataset size, especially with many unique values.
Which One Should You Use?
Situation | Use |
---|---|
Categories have meaningful order (Low to High) | Label |
No natural order in categories | One-Hot |
Limited number of unique categories | One-Hot |
Large number of categories | Label or advanced methods like Embedding |
Conclusion
Encoding categorical variables correctly is key to building accurate ML models. While Label Encoding is simple and space-efficient, One-Hot Encoding avoids false assumptions about category order. Understanding the nature of your data will help you choose the right method.