How to Work with CSV and JSON Datasets in Python
Introduction
When working with Machine Learning or data analysis projects, CSV and JSON are the most commonly used file formats for datasets. Python makes it easy to load, parse, and manipulate these formats using built-in libraries and external packages like pandas
and json
. In this post, you’ll learn how to work with CSV and JSON data efficiently.
What is a CSV File?
CSV (Comma-Separated Values) is a simple text file format used to store tabular data like spreadsheets. Each line represents a row, and commas separate the columns.
Example:
Name,Age,Gender
Alice,30,Female
Bob,25,Male
What is a JSON File?
JSON (JavaScript Object Notation) is a format used to store and transport structured data using key-value pairs. It’s commonly used in APIs and web data.
Example:
{
"Name": "Alice",
"Age": 30,
"Gender": "Female"
}
Reading CSV Files Using Pandas
import pandas as pd
# Load CSV file
df = pd.read_csv('data.csv')
# Display first 5 rows
print(df.head())
You can also specify:
delimiter
: if the separator is not a comma.usecols
: to load specific columns.na_values
: to handle missing values.
Writing to a CSV File
df.to_csv('output.csv', index=False)
Setting index=False
ensures the index column is not saved in the file.
Reading JSON Files
Python provides a built-in json
module.
import json
# Load JSON file
with open('data.json', 'r') as file:
data = json.load(file)
print(data)
For a JSON dataset structured as a list of dictionaries, you can load it into a DataFrame:
df = pd.DataFrame(data)
print(df.head())
Writing JSON Files
with open('output.json', 'w') as file:
json.dump(data, file, indent=4)
The indent=4
makes the output more readable.
Using Pandas to Read JSON
df = pd.read_json('data.json')
print(df.head())
You can also convert a DataFrame to JSON:
df.to_json('output.json', orient='records', lines=True)
Key Differences Between CSV and JSON
Feature | CSV | JSON |
---|---|---|
Structure | Tabular | Hierarchical |
Readability | Human-readable | Human-readable |
Best for | Tables/Spreadsheets | Nested or complex data |
Size | Smaller | Slightly larger |
Conclusion
Understanding how to handle CSV and JSON datasets is essential for any data-driven project. Python’s pandas
and json
modules make it simple to read, write, and manipulate data stored in these formats. Mastering this skill will help you handle real-world data more effectively in your Machine Learning journey.