Skip to main content

Memory Layout & Performance: C vs Fortran

NumPy arrays are stored in memory either row-major (C order) or column-major (Fortran order), fundamentally affecting how fast operations execute. Row-major stores rows consecutively; column-major stores columns. When your code accesses array elements sequentially along the storage order, it keeps data in the CPU cache, running 5–10x faster than when it jumps randomly across memory. Understanding memory layout and organizing code to respect it is critical for writing high-performance numerical algorithms.

Row-Major vs Column-Major: Memory Representation

Consider a 3x4 matrix in memory. In row-major (C order), all elements of row 0 come first, then row 1, then row 2. In column-major (Fortran order), all elements of column 0 come first, then column 1, and so on.

import numpy as np

# Create a matrix
matrix = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]], order='C')

# C order (row-major): memory layout is [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
print("C order (row-major):")
print(f" Strides: {matrix.strides}") # (32, 8) — skip 4 float64s to get next row
print(f" Flags: {matrix.flags['C_CONTIGUOUS']}") # True (contiguous in C order)

# Same matrix in Fortran order
matrix_f = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]], order='F')

# F order (column-major): memory layout is [1, 5, 9, 2, 6, 10, 3, 7, 11, 4, 8, 12]
print("\nF order (column-major):")
print(f" Strides: {matrix_f.strides}") # (8, 24) — stride by 1 element along columns
print(f" Flags: {matrix_f.flags['F_CONTIGUOUS']}") # True (contiguous in F order)

# Check data addresses: addresses differ between layouts
print(f"\nData pointer (C): {matrix.data}")
print(f"Data pointer (F): {matrix_f.data}")

The strides attribute encodes memory layout: strides tell NumPy how many bytes to skip to reach the next element along each axis. In C order, the last axis has the smallest stride (1 element = 8 bytes for float64). In Fortran order, the first axis has the smallest stride.

Cache Locality and Access Patterns

Modern CPUs access memory in cache lines (typically 64 bytes). When you iterate along the storage order, consecutive elements are in the same cache line, making access very fast. When you jump across memory, you cause cache misses, forcing the CPU to reload from main memory—orders of magnitude slower.

import numpy as np
import timeit

# Create a large matrix
matrix = np.random.randn(10000, 10000).astype(np.float32)

# Row-major iteration (fast): visit elements in memory order
def iterate_rows_fast(mat):
total = 0
for i in range(mat.shape[0]):
for j in range(mat.shape[1]):
total += mat[i, j]
return total

# Column-major iteration on row-major data (slow): jumps across memory
def iterate_cols_slow(mat):
total = 0
for j in range(mat.shape[1]):
for i in range(mat.shape[0]):
total += mat[i, j]
return total

# Benchmark (NumPy's built-in sum is fastest, but this illustrates the principle)
# Note: In practice, use vectorized operations; this is educational only
print("Cache locality impact (1000 x 1000 matrix, reduced size for demo):")
matrix_demo = np.random.randn(1000, 1000).astype(np.float32)
fast_time = timeit.timeit(lambda: iterate_rows_fast(matrix_demo), number=5)
slow_time = timeit.timeit(lambda: iterate_cols_slow(matrix_demo), number=5)
print(f"Row iteration (cache-friendly): {fast_time:.4f}s per run")
print(f"Column iteration (cache-hostile): {slow_time:.4f}s per run")
print(f"Slowdown factor: {slow_time / fast_time:.1f}x")
# Expected: 3-10x slowdown for column iteration

Optimizing for Memory Layout

Write loops and reshape operations to respect memory order. Choose loop order and array orientation to keep access patterns sequential.

Example 1: Reshape and Transpose

import numpy as np
import timeit

data = np.random.randn(5000, 5000)

# Transpose: creates a view with swapped strides (memory layout changes, no copy)
transposed = data.T
print(f"Original C-contiguous: {data.flags['C_CONTIGUOUS']}")
print(f"Transposed C-contiguous: {transposed.flags['C_CONTIGUOUS']}")
# Expected: transposed is NOT C-contiguous; it's a view with swapped strides

# If you iterate the transposed array, you're jumping across memory!
# Solution: force a copy if you'll iterate it
transposed_copy = data.T.copy()
print(f"Transposed copy C-contiguous: {transposed_copy.flags['C_CONTIGUOUS']}")

Example 2: Choose Loop Order for BLAS Operations

NumPy delegates matrix operations to BLAS (Basic Linear Algebra Subprograms), which expect a specific memory order. Always match:

import numpy as np
import timeit

# Large matrices
a = np.random.randn(2000, 2000)
b = np.random.randn(2000, 2000)

# C order (row-major) — NumPy's default
c_order = a @ b # matrix multiplication
time_c = timeit.timeit(lambda: a @ b, number=5)

# Fortran order
a_f = np.asfortranarray(a)
b_f = np.asfortranarray(b)
time_f = timeit.timeit(lambda: a_f @ b_f, number=5)

print(f"C order matmul: {time_c:.4f}s")
print(f"F order matmul: {time_f:.4f}s")
# BLAS is optimized for both; timing is similar, but wrong order can degrade

# Memory order doesn't matter for vectorized NumPy operations,
# but it does for iterations and custom loops

Contiguity and Data Conversion

When you slice, transpose, or reshape arrays, NumPy may break contiguity—meaning the array's strides no longer represent a simple sequential block. Non-contiguous arrays are slower in loops and may require copies in certain operations.

import numpy as np

# Contiguous array
original = np.arange(12).reshape(3, 4)
print(f"Original C-contiguous: {original.flags['C_CONTIGUOUS']}")
print(f"Original F-contiguous: {original.flags['F_CONTIGUOUS']}")

# Slice breaks contiguity in one direction
sliced = original[::2, :] # every other row
print(f"\nSliced (rows) C-contiguous: {sliced.flags['C_CONTIGUOUS']}")
# False — stride is now 64 bytes (skip a row), not 32 (4 elements)

# Transpose view is not C-contiguous
transposed = original.T
print(f"\nTransposed C-contiguous: {transposed.flags['C_CONTIGUOUS']}")
print(f"Transposed F-contiguous: {transposed.flags['F_CONTIGUOUS']}")
# Transposed is F-contiguous (column-major view of row-major original)

# Force contiguity if needed
contiguous_copy = np.ascontiguousarray(transposed)
print(f"\nAscontiguousarray C-contiguous: {contiguous_copy.flags['C_CONTIGUOUS']}")

Memory Layout Performance Summary

OperationMemory OrderImpact
Row iterationC (row-major)Fast (cache-friendly)
Column iterationC (row-major)Slow (cache-hostile, 5–10x slower)
BLAS operationsEitherUsually similar; check documentation
SlicingEitherMay break contiguity
TransposeEitherCreates view with swapped strides
FFT operationsEitherC order slightly faster on CPU

Key Takeaways

  • Row-major (C) order stores rows sequentially; column-major (Fortran) stores columns sequentially.
  • Strides encode memory layout; accessing elements along smallest-stride axes is cache-efficient and 5–10x faster.
  • Transposing and slicing may break contiguity, slowing loops; use .copy() or np.ascontiguousarray() if you'll iterate non-contiguous data.
  • Choose loop order to respect memory layout: iterate rows in C order, columns in Fortran order.
  • Use np.asfortranarray() to convert to Fortran order when interfacing with Fortran libraries; use np.ascontiguousarray() to ensure C contiguity.

Frequently Asked Questions

How do I check if an array is contiguous?

Use .flags['C_CONTIGUOUS'] or .flags['F_CONTIGUOUS'] attributes. An array can be both (single-row/column), neither (strided), or one but not the other (transposed views).

Does NumPy automatically choose the best memory order for me?

No. NumPy defaults to C order for most operations. You must explicitly pass order='F' to constructors or use np.asfortranarray() to switch. BLAS libraries are order-agnostic and optimize appropriately.

Should I always copy transposed arrays before iterating?

Only if you'll iterate in a way that violates cache locality. If you're using vectorized NumPy operations, copying is unnecessary—NumPy itself handles memory access patterns efficiently.

Why is my array slicing slow?

Slicing with non-unit strides (e.g., arr[::2, :]) breaks contiguity and may force copies in operations. If the slice is non-contiguous and you iterate it, copy it first with .copy().

Can I mix C and Fortran order arrays in operations?

Yes, NumPy handles it transparently. However, mixing orders in a loop-heavy algorithm may degrade cache efficiency. Keep consistent order within performance-critical code sections.

Further Reading