Skip to main content

Fancy Indexing & Boolean Masking: Extract

Fancy indexing and boolean masking are NumPy's high-level selection mechanisms for extracting subsets of array data based on arbitrary conditions—without writing explicit loops. Boolean masks filter arrays by logical conditions (arr > 5), while fancy indexing uses integer arrays to select non-contiguous elements. These techniques enable concise, vectorized data cleaning and transformation pipelines widely used in data analysis, ML preprocessing, and numerical algorithms.

Boolean Masking: Filter Data with Logical Conditions

A boolean mask is a boolean-typed array the same shape as the target array, with True marking elements to keep. Creating and applying masks is the foundation of vectorized data filtering.

import numpy as np

# Create sample data
scores = np.array([45, 82, 91, 67, 88, 73, 95, 58])

# Create boolean mask: elements greater than 80
mask = scores > 80
print(mask) # [False True True False True False True False]
print(type(mask), mask.dtype) # <class 'numpy.ndarray'> bool

# Apply mask to extract matching elements
passing = scores[mask]
print(passing) # [82 91 88 95]

# Chaining conditions with & (AND), | (OR), ~ (NOT)
high_and_even = (scores > 80) & (scores % 2 == 0)
print(scores[high_and_even]) # [82 88]

# NOT operator for exclusion
not_high = ~(scores > 80)
print(scores[not_high]) # [45 67 73 58]

Boolean masking returns a copy, not a view. If you modify the result, the original array is unaffected. However, boolean indexing is efficient: NumPy compiles the mask check at C level, so it's orders of magnitude faster than Python loop-based filtering.

Multi-Dimensional Masking

Boolean masks work on multi-dimensional arrays; the mask shape must match the target array, and indexing returns a 1D array of matching elements.

import numpy as np

# 2D temperature data (days x cities)
temps = np.array([[18, 22, 25],
[16, 20, 28],
[19, 21, 26]])

# Find all temps above 23 degrees (flattens result)
hot = temps > 23
print(temps[hot]) # [25 28 26]

# Boolean indexing also works on rows: find rows with mean > 21
row_means = temps.mean(axis=1)
warm_rows = row_means > 21
print(temps[warm_rows])
# Rows where mean > 21:
# [[18 22 25]
# [19 21 26]]

Fancy Indexing: Select with Integer Arrays

Fancy indexing uses integer arrays to specify which indices to extract. Unlike simple slicing (which returns contiguous elements), fancy indexing allows arbitrary index sequences, returning a copy.

import numpy as np

arr = np.array([10, 20, 30, 40, 50])

# Select by explicit indices
indices = np.array([0, 2, 4])
result = arr[indices]
print(result) # [10 30 50]

# Repeated indices are allowed (allows replication)
repeated_indices = np.array([1, 1, 3, 0])
result = arr[repeated_indices]
print(result) # [20 20 40 10]

# Negative indices work (count from end)
result = arr[np.array([-1, -2])]
print(result) # [50 40]

Multi-Dimensional Fancy Indexing

With 2D arrays, fancy indexing on rows and columns requires careful shape alignment:

import numpy as np

matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Select specific rows
row_indices = np.array([0, 2])
print(matrix[row_indices])
# [[1 2 3]
# [7 8 9]]

# Select specific columns (requires explicit row indexing too)
col_indices = np.array([0, 2])
row_idx, col_idx = np.meshgrid(np.arange(3), col_indices, indexing='ij')
result = matrix[row_idx, col_idx]
print(result)
# [[1 3]
# [4 6]
# [7 9]]

# Simpler approach: use advanced indexing with proper broadcasting
result = matrix[:, col_indices] # All rows, specific columns
print(result)

Combining Fancy Indexing and Boolean Masking

Real-world data filtering often combines both techniques: first filter by a condition, then extract specific columns or reorder results.

import numpy as np

# Sample: student data (name, score, attendance)
students = np.array([
['Alice', 85, 95],
['Bob', 92, 80],
['Charlie', 78, 90],
['Diana', 88, 92]
])

# Find rows where score > 80
scores = students[:, 1].astype(int)
high_scorers = students[scores > 80]
print(high_scorers)
# [['Alice' '85' '95']
# ['Bob' '92' '80']
# ['Diana' '88' '92']]

# Numeric data: extract and process
numeric_data = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]])

# Find rows where column 0 > 3, then get columns 1 and 2
mask = numeric_data[:, 0] > 3
filtered = numeric_data[mask] # rows 1-3
selected_cols = filtered[:, [1, 2]] # columns 1 and 2
print(selected_cols)
# [[ 5 6]
# [ 8 9]
# [11 12]]

Searchsorted and Binning

For sorted data, np.searchsorted() efficiently finds insertion positions, enabling fast lookups and binning without explicit loops.

import numpy as np

# Sorted array of values
sorted_values = np.array([10, 20, 30, 40, 50])

# Find where new values would be inserted to maintain order
new_values = np.array([15, 35, 5, 50])
positions = np.searchsorted(sorted_values, new_values)
print(positions) # [1 3 0 4]

# Binning data into categories
data = np.array([12, 25, 38, 8, 45, 31, 19])
bins = np.array([10, 20, 30, 40, 50])
bin_indices = np.searchsorted(bins, data)
print(bin_indices) # [1 2 3 0 4 3 2]

# Categorize into labels
labels = np.array(['low', 'medium', 'high', 'very_high', 'extreme'])
data_categories = labels[bin_indices]
print(data_categories)
# ['medium' 'medium' 'high' 'low' 'extreme' 'high' 'medium']

Performance: Vectorized Filtering vs Loops

Boolean masking and fancy indexing execute at C speed; Python loops are 50–100x slower for large arrays:

import numpy as np
import timeit

data = np.random.randn(1000000)

# Vectorized (boolean mask)
def filter_vectorized(arr):
return arr[arr > 0]

# Pure Python loop
def filter_loop(arr):
result = []
for x in arr:
if x > 0:
result.append(x)
return np.array(result)

vectorized_time = timeit.timeit(lambda: filter_vectorized(data), number=10)
loop_time = timeit.timeit(lambda: filter_loop(data), number=10)

print(f"Vectorized: {vectorized_time:.4f}s per iteration")
print(f"Loop: {loop_time:.4f}s per iteration")
print(f"Speedup: {loop_time / vectorized_time:.1f}x")
# Expected: 50-100x speedup

Key Takeaways

  • Boolean masks filter arrays by condition; create with logical operators (>, <, ==, &, |, ~).
  • Fancy indexing uses integer arrays to select non-contiguous elements; allows replication and arbitrary ordering.
  • Combine boolean masking and fancy indexing for complex data extraction: first filter, then select columns or reorder.
  • np.searchsorted() efficiently finds insertion positions and bins sorted data in O(log n) time.
  • Boolean masking and fancy indexing execute at C speed, delivering 50–100x speedup over Python loops.

Frequently Asked Questions

Does boolean masking return a view or a copy?

Boolean indexing always returns a copy, not a view. The mask selects arbitrary elements that may not be contiguous, so copying is necessary. If you modify the result, the original is unaffected.

How do I create a boolean mask with multiple conditions?

Use & (AND), | (OR), and ~ (NOT) operators: (arr > 5) & (arr < 10) or (arr == 0) | (arr == 1). Remember to wrap each condition in parentheses to ensure correct operator precedence.

Can I use fancy indexing to assign values?

Yes: arr[indices] = new_values. However, if indices repeat, all occurrences of that index receive the assignment (last value wins in case of duplicates).

What is the difference between arr[mask] and np.where(mask)?

arr[mask] returns a 1D array of matching elements. np.where(condition) returns indices where the condition is true. Use arr[np.where(condition)] to mimic arr[condition], but arr[condition] is simpler and faster.

How do I find the indices of non-zero elements?

Use np.nonzero() or np.argwhere(). np.nonzero(arr) returns a tuple of index arrays; np.argwhere(arr) returns a 2D array of indices. For boolean finding, np.where(condition) is most common.

Further Reading