Handling Missing Data in ML Datasets
Missing data is one of the most common challenges in any machine learning or data analysis project. If not handled properly, missing values can lead to incorrect insights, biased models, and poor predictions.
In this blog post, you’ll learn why data can be missing, how to detect missing values, and common strategies to handle them effectively using Python.
Why Does Data Go Missing?
Data can be incomplete for several reasons:
- Human error during data entry
- Sensor failure or data collection issues
- Data corruption or loss in storage
- Intentional omission, e.g., respondents choosing not to answer survey questions
Understanding why data is missing helps decide the best way to handle it.
Types of Missing Data
- Missing Completely at Random (MCAR)
The missing values have no relationship with any variable in the dataset. - Missing at Random (MAR)
The missingness depends on other observed variables. - Missing Not at Random (MNAR)
The missingness is related to the value itself (e.g., high income not reported).
Detecting Missing Values in Python
You can use Pandas to quickly spot missing values.
import pandas as pd
df = pd.read_csv('data.csv')
# Check for missing values
print(df.isnull().sum())
Example output:
Age 3
Salary 2
Gender 0
Common Strategies for Handling Missing Data
1. Removing Data
- Drop Rows with Missing Values
df_cleaned = df.dropna()
- Drop Columns with Too Many Missing Values
df_cleaned = df.dropna(axis=1)
Use this only when you have sufficient data left after removal.
2. Imputation (Filling in Missing Values)
- Fill with Mean or Median (Numerical Columns)
df['Age'].fillna(df['Age'].mean(), inplace=True)
- Fill with Mode (Categorical Columns)
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
- Forward Fill
Fills missing value with previous value in the column:
df.fillna(method='ffill', inplace=True)
- Backward Fill
Fills using the next value:
df.fillna(method='bfill', inplace=True)
3. Predictive Imputation
Advanced techniques involve predicting missing values using machine learning models, such as k-NN or regression:
- Use scikit-learn’s
KNNImputer
:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=3)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
4. Using Flags for Missingness
Sometimes, it helps to mark missing values as a separate category or add an indicator column:
df['Age_missing'] = df['Age'].isnull()
This can be useful in models that can interpret missingness as information.
Best Practices
✅ Always explore why data is missing before deciding on a strategy.
✅ Avoid dropping rows if it leads to losing too much data.
✅ Impute numerical and categorical columns differently.
✅ Be consistent in handling missing values during training and prediction.
✅ Document your choices for reproducibility.
Conclusion
Handling missing data properly is essential for building robust machine learning models. Whether you remove, impute, or flag missing values, the key is to make thoughtful decisions based on the nature of your data and the problem you are solving.