Handling Missing Data in ML Datasets

Published on August 3, 2025 by @mritxperts

Missing data is one of the most common challenges in any machine learning or data analysis project. If not handled properly, missing values can lead to incorrect insights, biased models, and poor predictions.

In this blog post, you’ll learn why data can be missing, how to detect missing values, and common strategies to handle them effectively using Python.

Why Does Data Go Missing?

Data can be incomplete for several reasons:

Human error during data entry
Sensor failure or data collection issues
Data corruption or loss in storage
Intentional omission, e.g., respondents choosing not to answer survey questions

Understanding why data is missing helps decide the best way to handle it.

Types of Missing Data

Missing Completely at Random (MCAR)
The missing values have no relationship with any variable in the dataset.
Missing at Random (MAR)
The missingness depends on other observed variables.
Missing Not at Random (MNAR)
The missingness is related to the value itself (e.g., high income not reported).

Detecting Missing Values in Python

You can use Pandas to quickly spot missing values.

import pandas as pd

df = pd.read_csv('data.csv')

# Check for missing values
print(df.isnull().sum())

Example output:

Age         3
Salary      2
Gender      0

Common Strategies for Handling Missing Data

1. Removing Data

Drop Rows with Missing Values

df_cleaned = df.dropna()

Drop Columns with Too Many Missing Values

df_cleaned = df.dropna(axis=1)

Use this only when you have sufficient data left after removal.

2. Imputation (Filling in Missing Values)

Fill with Mean or Median (Numerical Columns)

df['Age'].fillna(df['Age'].mean(), inplace=True)

Fill with Mode (Categorical Columns)

df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)

Forward Fill

Fills missing value with previous value in the column:

df.fillna(method='ffill', inplace=True)

Backward Fill

Fills using the next value:

df.fillna(method='bfill', inplace=True)

3. Predictive Imputation

Advanced techniques involve predicting missing values using machine learning models, such as k-NN or regression:

Use scikit-learn’s KNNImputer:

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=3)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

4. Using Flags for Missingness

Sometimes, it helps to mark missing values as a separate category or add an indicator column:

df['Age_missing'] = df['Age'].isnull()

This can be useful in models that can interpret missingness as information.

Best Practices

✅ Always explore why data is missing before deciding on a strategy.
✅ Avoid dropping rows if it leads to losing too much data.
✅ Impute numerical and categorical columns differently.
✅ Be consistent in handling missing values during training and prediction.
✅ Document your choices for reproducibility.

Conclusion

Handling missing data properly is essential for building robust machine learning models. Whether you remove, impute, or flag missing values, the key is to make thoughtful decisions based on the nature of your data and the problem you are solving.