Pandas (Part 4): Data cleaning and preparation
Following our lesson on Data selection and indexing, this article explores Pandas (Part 4): Data cleaning and preparation. Real-world data is often messy. Data cleaning is a crucial step in any data analysis workflow.
📚 Prerequisites
- Understanding of Pandas DataFrames.
🎯 Article Outline: What You'll Master
- ✅ Handling Missing Data: How to find, drop, and fill missing values.
- ✅ Removing Duplicates: How to identify and remove duplicate rows.
- ✅ Data Transformation: Applying functions to transform data.
- ✅ String Methods: Working with text data.
🧠 Section 1: The Core Concepts of Data Cleaning
Data cleaning, or data wrangling, is the process of transforming raw data into a clean and usable format. This includes handling missing values, correcting errors, and making the data consistent.
💻 Section 2: Deep Dive - Implementation and Walkthrough
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': [10, 20, 30, 40],
'D': [10, 20, 20, 40]},
index=['foo', 'bar', 'baz', 'qux'])
# Handling missing data
# Drop rows with any missing values
df.dropna(how='any')
# Fill missing values
df.fillna(value=5)
# Removing duplicates
df.duplicated('D')
df.drop_duplicates('D', keep='last')
# Data transformation
df['C'].apply(lambda x: x * 2)
# String methods
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
💡 Conclusion & Key Takeaways
You've learned some of the most common data cleaning techniques in Pandas. These skills are essential for preparing your data for analysis and modeling.
Let's summarize the key takeaways:
- Pandas provides powerful tools for handling missing data (
dropna,fillna). - You can easily find and remove duplicate data (
duplicated,drop_duplicates). - The
.apply()method is useful for transforming data. - Pandas has a rich set of string methods for working with text data.
➡️ Next Steps
In the next article, "Pandas (Part 5): Grouping and aggregation", we'll learn how to group data and perform aggregate calculations.
Glossary
- Data Cleaning: The process of detecting and correcting corrupt or inaccurate records from a record set, table, or database.
- Missing Data: Data that is not available or not recorded.