Skip to main content

Sets and Dicts: Python's Fast Data Structures

Choosing the right data structure can accelerate your code by orders of magnitude. Python's sets and dictionaries are hash table implementations offering O(1) average-case lookups, making them vastly faster than lists for membership testing and deduplication. Understanding when and how to use them is a core optimization skill.

Sets vs Lists: Why O(1) Beats O(n)

A list stores items in order and searches linearly. A set stores items in a hash table and searches in constant time:

import timeit

# Create test data
numbers_list = list(range(1000000))
numbers_set = set(range(1000000))

# Lookup in list — O(n)
def search_list():
return 999999 in numbers_list

# Lookup in set — O(1)
def search_set():
return 999999 in numbers_set

time_list = timeit.timeit(search_list, number=100000)
time_set = timeit.timeit(search_set, number=100000)

print(f"List lookup: {time_list:.4f}s")
print(f"Set lookup: {time_set:.4f}s")
print(f"Set is {time_list/time_set:.0f}x faster")

Output:

List lookup: 45.2314s
Set lookup: 0.0023s
Set is 19671x faster

The set is nearly 20,000 times faster because it looks up the item in a hash table (one operation) instead of scanning all 1 million items. This difference dominates performance for large membership tests.

How Hash Tables Work

Sets and dicts implement hash tables. When you add an item, Python hashes it (converts it to a number) and uses that to find a bucket:

# Simplified hash table concept
my_set = {1, 2, 3, 'hello'}

# When you check membership
if 'hello' in my_set:
# Python computes hash('hello')
# Uses the hash to find the bucket
# Checks the bucket for 'hello' (very fast)
pass

Hash collisions (two items hashing to the same bucket) are handled automatically with chaining or open addressing. In practice, Python's hash table scales to millions of items with O(1) average performance.

Use Cases for Sets

Deduplication

Remove duplicates from a list in one operation:

# Remove duplicates preserving list — uses set
numbers = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4]
unique = list(set(numbers))
print(unique) # [1, 2, 3, 4] (order not guaranteed)

# Remove duplicates preserving order
def remove_duplicates_ordered(items):
seen = set()
result = []
for item in items:
if item not in seen:
seen.add(item)
result.append(item)
return result

ordered_unique = remove_duplicates_ordered(numbers)
print(ordered_unique) # [1, 2, 3, 4] (original order preserved)

The set version is O(n) with a small constant factor. The naive approach (comparing each item to all previous items) is O(n²) and impractical for large lists.

Membership Testing in Loops

When you repeatedly check if items exist, use a set, not a list:

# BAD — O(n²) if many membership checks
allowed_ids_list = [1, 2, 3, 4, 5]
for user_id in range(1000000):
if user_id in allowed_ids_list: # O(n) each iteration
process_user(user_id)

# GOOD — O(n + m) where n = len(allowed_ids), m = range(1000000)
allowed_ids_set = {1, 2, 3, 4, 5}
for user_id in range(1000000):
if user_id in allowed_ids_set: # O(1) each iteration
process_user(user_id)

Converting to a set adds O(n) one-time cost, then saves O(n) per lookup. With millions of iterations, this is a massive win.

Set Operations

Sets support efficient mathematical operations:

team_a = {'Alice', 'Bob', 'Charlie'}
team_b = {'Bob', 'David', 'Eve'}

# Union — all people in either team — O(n + m)
all_people = team_a | team_b
print(all_people) # {'Alice', 'Bob', 'Charlie', 'David', 'Eve'}

# Intersection — people in both teams — O(min(n, m))
overlap = team_a & team_b
print(overlap) # {'Bob'}

# Difference — people in team_a but not team_b — O(n)
only_a = team_a - team_b
print(only_a) # {'Alice', 'Charlie'}

# Symmetric difference — people in either team but not both — O(n + m)
exclusive = team_a ^ team_b
print(exclusive) # {'Alice', 'Charlie', 'David', 'Eve'}

These operations are O(n) total (not O(n²)), making set operations fast even for large datasets.

Subset and Superset Checks

required_skills = {'Python', 'SQL', 'Linux'}
alice_skills = {'Python', 'SQL', 'Linux', 'Docker', 'Git'}

# Check if alice_skills contains all required_skills
if required_skills.issubset(alice_skills):
print("Alice qualifies") # True

# Check if alice_skills is a superset of required_skills
if alice_skills >= required_skills:
print("Alice qualifies") # True — >= operator

These are O(n) operations and cleaner than manual loops.

Dictionaries for Fast Lookups

Dictionaries (hash maps) are O(1) average for lookups, insertions, and deletions:

# BAD — O(n) to find user by id (linear search)
users_list = [
{'id': 1, 'name': 'Alice', 'role': 'admin'},
{'id': 2, 'name': 'Bob', 'role': 'user'},
{'id': 3, 'name': 'Charlie', 'role': 'user'},
]
for user in users_list:
if user['id'] == 2:
found_user = user
break

# GOOD — O(1) to find user by id
users_dict = {
1: {'name': 'Alice', 'role': 'admin'},
2: {'name': 'Bob', 'role': 'user'},
3: {'name': 'Charlie', 'role': 'user'},
}
found_user = users_dict[2] # instant lookup

Restructuring data as a dictionary with IDs as keys makes lookups instant. This pattern is ubiquitous in production code.

Counting with defaultdict and Counter

from collections import defaultdict, Counter

text = "the quick brown fox jumps over the lazy dog"
words = text.split()

# BAD — manual checking
word_count = {}
for word in words:
if word in word_count:
word_count[word] += 1
else:
word_count[word] = 1

# GOOD — defaultdict eliminates the check
word_count = defaultdict(int)
for word in words:
word_count[word] += 1

# BETTER — Counter is optimized for counting
word_count = Counter(words)
print(word_count.most_common(3)) # Most common 3 words

Counter is a specialized dict subclass for counting. It's faster and includes methods like most_common(). Both are O(n).

When to Use Lists

Lists are still appropriate for:

  • Iteration order matters: Lists maintain insertion order (in Python 3.7+, so do dicts, but lists are clearer).
  • Indexing: Access the nth item in O(1) with lists; sets require iteration.
  • Duplicate values: Sets store unique items; lists allow duplicates.
  • Small collections: Below ~100 items, the overhead difference is negligible.
# Use list when order or indexing matters
leaderboard = ['Alice', 'Bob', 'Charlie'] # indexed by position
top_scorer = leaderboard[0] # fast

# Use set for membership testing on large collections
banned_users = {'bad_user_1', 'bad_user_2', ...} # 100,000 items
if request.user_id not in banned_users: # fast O(1) check
process_request(request)

Performance Comparison Table

OperationListSetDict
Search/lookupO(n)O(1)O(1)
Add itemO(1) appendO(1)O(1)
Remove itemO(n)O(1)O(1)
IterateO(n)O(n)O(n)
Duplicate valuesYesNoNo (keys)
Maintains orderYesNoYes (3.7+)
IndexingYesNoNo
Union/intersectionManual O(n²)O(n)Manual O(n²)

Key Takeaways

  • Sets and dicts use hash tables offering O(1) average lookups versus O(n) for lists
  • Convert lists to sets before repeated membership tests in loops (converts loop complexity from O(n²) to O(n))
  • Use sets for deduplication, intersection, union, and difference operations
  • Use dicts to map keys to values for fast lookups by key
  • Counter and defaultdict optimize counting and default value handling
  • Lists remain appropriate for ordered iteration, indexing, and small collections

Frequently Asked Questions

Why is set lookup O(1) and not O(n)?

Hash tables compute a hash value from the item (constant time), use it as a direct index into an array, and retrieve the item in one operation. Even with collisions, the average chain length is O(1) when the load factor is balanced. This is why hash table performance is excellent on average.

What happens if I hash an unhashable type?

Sets and dicts require hashable items (immutable types like int, str, tuple). If you try to add a list or dict to a set, Python raises TypeError: unhashable type. Convert to tuples if you need to store mutable structures in sets.

Are there performance downsides to sets?

Sets use more memory than lists due to hash table overhead. A set of 1 million integers uses ~50 MB; a list uses ~8 MB. Choose sets for fast lookups, lists for memory efficiency and ordering. For most applications, speed wins over memory.

When do set operations become slower than O(1)?

Union, intersection, and other set operations are O(n) where n is the size of the result. Membership testing is O(1) because you're checking one item. If you're checking membership for billions of items sequentially, consider other structures like Bloom filters for space efficiency.

Can I sort a set?

Sets are unordered. Convert to a sorted list with sorted(my_set), which is O(n log n). If you need both fast lookups and order, use a dict with indexed keys, or maintain a separate sorted list alongside a set.

Further Reading