Skip to main content

Pandas (Part 3): Data selection and indexing

Following our lesson on Reading and writing data, this article explores Pandas (Part 3): Data selection and indexing. Selecting and filtering data is one of the most common tasks in data analysis.


📚 Prerequisites

  • Understanding of Pandas DataFrames.

🎯 Article Outline: What You'll Master

  • Selecting Columns: How to select a single column or multiple columns.
  • Selecting Rows: Using .loc for label-based indexing and .iloc for position-based indexing.
  • Boolean Indexing: Filtering data based on conditions.
  • Setting Data: How to set new values in a DataFrame.

🧠 Section 1: The Core Concepts of Data Selection

Pandas provides a variety of ways to select data. The most common are:

  • []: Selects columns by name.
  • .loc[]: Selects rows and columns by label.
  • .iloc[]: Selects rows and columns by integer position.

💻 Section 2: Deep Dive - Implementation and Walkthrough

import pandas as pd
import numpy as np

# Create a sample DataFrame
dates = pd.date_range('20230101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

# Selecting a single column
print(df['A'])

# Selecting multiple columns
print(df[['A', 'B']])

# Selecting rows by label
print(df.loc['20230102':'20230104'])

# Selecting rows by position
print(df.iloc[3])

# Boolean indexing
print(df[df['A'] > 0])

# Setting a new column
df['F'] = ['foo', 'bar', 'baz', 'qux', 'quux', 'corge']
print(df)

💡 Conclusion & Key Takeaways

You've learned the fundamental techniques for selecting and filtering data in a Pandas DataFrame.

Let's summarize the key takeaways:

  • Use [] for selecting columns.
  • Use .loc for label-based selection and .iloc for position-based selection.
  • Boolean indexing is a powerful way to filter data.

➡️ Next Steps

In the next article, "Pandas (Part 4): Data cleaning and preparation", we'll learn how to handle missing data and perform other data cleaning tasks.


Glossary

  • Indexing: The process of selecting data from a DataFrame.
  • Label: The name of a row or column.
  • Position: The integer location of a row or column.

Further Reading