Fancy Indexing & Boolean Masking: Extract
Fancy indexing and boolean masking are NumPy's high-level selection mechanisms for extracting subsets of array data based on arbitrary conditions—without writing explicit loops. Boolean masks filter arrays by logical conditions (arr > 5), while fancy indexing uses integer arrays to select non-contiguous elements. These techniques enable concise, vectorized data cleaning and transformation pipelines widely used in data analysis, ML preprocessing, and numerical algorithms.
Boolean Masking: Filter Data with Logical Conditions
A boolean mask is a boolean-typed array the same shape as the target array, with True marking elements to keep. Creating and applying masks is the foundation of vectorized data filtering.
import numpy as np
# Create sample data
scores = np.array([45, 82, 91, 67, 88, 73, 95, 58])
# Create boolean mask: elements greater than 80
mask = scores > 80
print(mask) # [False True True False True False True False]
print(type(mask), mask.dtype) # <class 'numpy.ndarray'> bool
# Apply mask to extract matching elements
passing = scores[mask]
print(passing) # [82 91 88 95]
# Chaining conditions with & (AND), | (OR), ~ (NOT)
high_and_even = (scores > 80) & (scores % 2 == 0)
print(scores[high_and_even]) # [82 88]
# NOT operator for exclusion
not_high = ~(scores > 80)
print(scores[not_high]) # [45 67 73 58]
Boolean masking returns a copy, not a view. If you modify the result, the original array is unaffected. However, boolean indexing is efficient: NumPy compiles the mask check at C level, so it's orders of magnitude faster than Python loop-based filtering.
Multi-Dimensional Masking
Boolean masks work on multi-dimensional arrays; the mask shape must match the target array, and indexing returns a 1D array of matching elements.
import numpy as np
# 2D temperature data (days x cities)
temps = np.array([[18, 22, 25],
[16, 20, 28],
[19, 21, 26]])
# Find all temps above 23 degrees (flattens result)
hot = temps > 23
print(temps[hot]) # [25 28 26]
# Boolean indexing also works on rows: find rows with mean > 21
row_means = temps.mean(axis=1)
warm_rows = row_means > 21
print(temps[warm_rows])
# Rows where mean > 21:
# [[18 22 25]
# [19 21 26]]
Fancy Indexing: Select with Integer Arrays
Fancy indexing uses integer arrays to specify which indices to extract. Unlike simple slicing (which returns contiguous elements), fancy indexing allows arbitrary index sequences, returning a copy.
import numpy as np
arr = np.array([10, 20, 30, 40, 50])
# Select by explicit indices
indices = np.array([0, 2, 4])
result = arr[indices]
print(result) # [10 30 50]
# Repeated indices are allowed (allows replication)
repeated_indices = np.array([1, 1, 3, 0])
result = arr[repeated_indices]
print(result) # [20 20 40 10]
# Negative indices work (count from end)
result = arr[np.array([-1, -2])]
print(result) # [50 40]
Multi-Dimensional Fancy Indexing
With 2D arrays, fancy indexing on rows and columns requires careful shape alignment:
import numpy as np
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# Select specific rows
row_indices = np.array([0, 2])
print(matrix[row_indices])
# [[1 2 3]
# [7 8 9]]
# Select specific columns (requires explicit row indexing too)
col_indices = np.array([0, 2])
row_idx, col_idx = np.meshgrid(np.arange(3), col_indices, indexing='ij')
result = matrix[row_idx, col_idx]
print(result)
# [[1 3]
# [4 6]
# [7 9]]
# Simpler approach: use advanced indexing with proper broadcasting
result = matrix[:, col_indices] # All rows, specific columns
print(result)
Combining Fancy Indexing and Boolean Masking
Real-world data filtering often combines both techniques: first filter by a condition, then extract specific columns or reorder results.
import numpy as np
# Sample: student data (name, score, attendance)
students = np.array([
['Alice', 85, 95],
['Bob', 92, 80],
['Charlie', 78, 90],
['Diana', 88, 92]
])
# Find rows where score > 80
scores = students[:, 1].astype(int)
high_scorers = students[scores > 80]
print(high_scorers)
# [['Alice' '85' '95']
# ['Bob' '92' '80']
# ['Diana' '88' '92']]
# Numeric data: extract and process
numeric_data = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]])
# Find rows where column 0 > 3, then get columns 1 and 2
mask = numeric_data[:, 0] > 3
filtered = numeric_data[mask] # rows 1-3
selected_cols = filtered[:, [1, 2]] # columns 1 and 2
print(selected_cols)
# [[ 5 6]
# [ 8 9]
# [11 12]]
Searchsorted and Binning
For sorted data, np.searchsorted() efficiently finds insertion positions, enabling fast lookups and binning without explicit loops.
import numpy as np
# Sorted array of values
sorted_values = np.array([10, 20, 30, 40, 50])
# Find where new values would be inserted to maintain order
new_values = np.array([15, 35, 5, 50])
positions = np.searchsorted(sorted_values, new_values)
print(positions) # [1 3 0 4]
# Binning data into categories
data = np.array([12, 25, 38, 8, 45, 31, 19])
bins = np.array([10, 20, 30, 40, 50])
bin_indices = np.searchsorted(bins, data)
print(bin_indices) # [1 2 3 0 4 3 2]
# Categorize into labels
labels = np.array(['low', 'medium', 'high', 'very_high', 'extreme'])
data_categories = labels[bin_indices]
print(data_categories)
# ['medium' 'medium' 'high' 'low' 'extreme' 'high' 'medium']
Performance: Vectorized Filtering vs Loops
Boolean masking and fancy indexing execute at C speed; Python loops are 50–100x slower for large arrays:
import numpy as np
import timeit
data = np.random.randn(1000000)
# Vectorized (boolean mask)
def filter_vectorized(arr):
return arr[arr > 0]
# Pure Python loop
def filter_loop(arr):
result = []
for x in arr:
if x > 0:
result.append(x)
return np.array(result)
vectorized_time = timeit.timeit(lambda: filter_vectorized(data), number=10)
loop_time = timeit.timeit(lambda: filter_loop(data), number=10)
print(f"Vectorized: {vectorized_time:.4f}s per iteration")
print(f"Loop: {loop_time:.4f}s per iteration")
print(f"Speedup: {loop_time / vectorized_time:.1f}x")
# Expected: 50-100x speedup
Key Takeaways
- Boolean masks filter arrays by condition; create with logical operators (
>,<,==,&,|,~). - Fancy indexing uses integer arrays to select non-contiguous elements; allows replication and arbitrary ordering.
- Combine boolean masking and fancy indexing for complex data extraction: first filter, then select columns or reorder.
np.searchsorted()efficiently finds insertion positions and bins sorted data in O(log n) time.- Boolean masking and fancy indexing execute at C speed, delivering 50–100x speedup over Python loops.
Frequently Asked Questions
Does boolean masking return a view or a copy?
Boolean indexing always returns a copy, not a view. The mask selects arbitrary elements that may not be contiguous, so copying is necessary. If you modify the result, the original is unaffected.
How do I create a boolean mask with multiple conditions?
Use & (AND), | (OR), and ~ (NOT) operators: (arr > 5) & (arr < 10) or (arr == 0) | (arr == 1). Remember to wrap each condition in parentheses to ensure correct operator precedence.
Can I use fancy indexing to assign values?
Yes: arr[indices] = new_values. However, if indices repeat, all occurrences of that index receive the assignment (last value wins in case of duplicates).
What is the difference between arr[mask] and np.where(mask)?
arr[mask] returns a 1D array of matching elements. np.where(condition) returns indices where the condition is true. Use arr[np.where(condition)] to mimic arr[condition], but arr[condition] is simpler and faster.
How do I find the indices of non-zero elements?
Use np.nonzero() or np.argwhere(). np.nonzero(arr) returns a tuple of index arrays; np.argwhere(arr) returns a 2D array of indices. For boolean finding, np.where(condition) is most common.