Repo: https://github.com/Henry8r8w/Matrix-Multiplication-Benchmark. Two algorithms today—naive and strassen—under a single
make
build, with a helper script to sweep sizes. Directory layout:include/
,src/
,scripts/
,Makefile
.
What’s in the repo (right now)
- Algorithms:
naive
(classic triple loop) andstrassen
(recursive). - Build:
make
→ producesmatrix_benchmark
. :contentReference[oaicite:2]{index=2} - Run (single size):
./matrix_benchmark <size> <iterations>
; Strassen prefers powers of two. - Batch (sizes 64, 128, 256, 512):
./scripts/run_exp.sh
→ writesresult.csv
+experiment.log
. I keep the surface area small and auditable. If you want to extend it (blocked kernels, SIMD, OpenMP), the interface stays stable.
How I measure (GFLOP/s, minimal but sound)
For (C=AB) with (A,B\in\mathbb{R}^{n\times n}), the dense algorithm performs (2n^3) floating-point ops (exactly (2n^3-n^2) if you count n–1 adds per dot-product). I report:
[ \text{GF/s}=\frac{2n^3}{t\cdot 10^9}\quad(\text{peak = best of (k) trials}) ]
The general (m\times k) by (k\times n) count is (2mnk), so the square case reduces to (2n^3).
Notes I stick to: a warm-up run; (k\ge3) trials; keep compiler flags in the log; and keep other apps closed so frequency scaling doesn’t pollute results (bog-standard HPC hygiene).
Pseudocode (exact shape I map into C)
1) Naive (row-major, i–k–j order)
This order reuses A[i,k]
across the inner j
loop and streams B[k,*]
contiguously.
# C = A * B, all n×n
for i in 0..n-1:
for k in 0..n-1:
aik = A[i,k]
for j in 0..n-1:
C[i,j] += aik * B[k,j]
2) Strassen (recursive; stop at a tuned base size)
Asymptotics improve from $\Theta(n^3)$ to $\Theta(n^{\log_2 7}) \approx \Theta(n^{2.807})$. Real performance depends on base cutoff, memory traffic, and how you manage temporaries.
# Multiply two n×n matrices; assume n even (pad if needed).
# Base case should call a well-optimized classical kernel (in my C it’s the same triple loop).
function STRASSEN(A, B, n):
if n <= b0: # b0 tuned experimentally
return BASE_GEMM(A, B) # same math as naive, possibly blocked later
# 1) Partition into quadrants (n/2 × n/2)
split A -> A11, A12, A21, A22
split B -> B11, B12, B21, B22
# 2) Seven products (recurse)
M1 = STRASSEN(A11 + A22, B11 + B22)
M2 = STRASSEN(A21 + A22, B11)
M3 = STRASSEN(A11, B12 - B22)
M4 = STRASSEN(A22, B21 - B11)
M5 = STRASSEN(A11 + A12, B11)
M6 = STRASSEN(A21 - A11, B11 + B12)
M7 = STRASSEN(A12 - A22, B21 + B22)
# 3) Combine into C’s quadrants
C11 = M1 + M4 - M5 + M7
C12 = M3 + M5
C21 = M2 + M4
C22 = M1 - M2 + M3 + M6
return combine(C11, C12, C21, C22)
Implementation notes I follow in C:
- Power-of-two sizes: keep the recursion simple (the repo calls this out explicitly). Pad or switch to the base kernel when uneven
- Workspace reuse: allocate temporaries once per level; pass submatrix views (pointer + leading dimension), not copies.
- Base cutoff $b_0$: tune on the target CPU (range 64–256 is common in practice).
Build → run
# build
make
# single size (example)
./matrix_benchmark 256 3
# sweep a few sizes and log results (CSV + log file)
./scripts/run_exp.sh
All of this is in the repo’s README and scripts; I keep the interface stable