Repo: https://github.com/Henry8r8w/Matrix-Multiplication-Benchmark. Two algorithms today—naive and strassen—under a single make build, with a helper script to sweep sizes. Directory layout: include/, src/, scripts/, Makefile.


What’s in the repo (right now)

  • Algorithms: naive (classic triple loop) and strassen (recursive).
  • Build: make → produces matrix_benchmark. :contentReference[oaicite:2]{index=2}
  • Run (single size): ./matrix_benchmark <size> <iterations>; Strassen prefers powers of two.
  • Batch (sizes 64, 128, 256, 512): ./scripts/run_exp.sh → writes result.csv + experiment.log. I keep the surface area small and auditable. If you want to extend it (blocked kernels, SIMD, OpenMP), the interface stays stable.

How I measure (GFLOP/s, minimal but sound)

For (C=AB) with (A,B\in\mathbb{R}^{n\times n}), the dense algorithm performs (2n^3) floating-point ops (exactly (2n^3-n^2) if you count n–1 adds per dot-product). I report:

[ \text{GF/s}=\frac{2n^3}{t\cdot 10^9}\quad(\text{peak = best of (k) trials}) ]

The general (m\times k) by (k\times n) count is (2mnk), so the square case reduces to (2n^3).

Notes I stick to: a warm-up run; (k\ge3) trials; keep compiler flags in the log; and keep other apps closed so frequency scaling doesn’t pollute results (bog-standard HPC hygiene).


Pseudocode (exact shape I map into C)

1) Naive (row-major, i–k–j order)

This order reuses A[i,k] across the inner j loop and streams B[k,*] contiguously.

# C = A * B, all n×n
for i in 0..n-1:
  for k in 0..n-1:
    aik = A[i,k]
    for j in 0..n-1:
      C[i,j] += aik * B[k,j]

2) Strassen (recursive; stop at a tuned base size)

Asymptotics improve from $\Theta(n^3)$ to $\Theta(n^{\log_2 7}) \approx \Theta(n^{2.807})$. Real performance depends on base cutoff, memory traffic, and how you manage temporaries.

# Multiply two n×n matrices; assume n even (pad if needed).
# Base case should call a well-optimized classical kernel (in my C it’s the same triple loop).
function STRASSEN(A, B, n):
  if n <= b0:                    # b0 tuned experimentally
    return BASE_GEMM(A, B)       # same math as naive, possibly blocked later

  # 1) Partition into quadrants (n/2 × n/2)
  split A -> A11, A12, A21, A22
  split B -> B11, B12, B21, B22

  # 2) Seven products (recurse)
  M1 = STRASSEN(A11 + A22, B11 + B22)
  M2 = STRASSEN(A21 + A22, B11)
  M3 = STRASSEN(A11,       B12 - B22)
  M4 = STRASSEN(A22,       B21 - B11)
  M5 = STRASSEN(A11 + A12, B11)
  M6 = STRASSEN(A21 - A11, B11 + B12)
  M7 = STRASSEN(A12 - A22, B21 + B22)

  # 3) Combine into C’s quadrants
  C11 = M1 + M4 - M5 + M7
  C12 = M3 + M5
  C21 = M2 + M4
  C22 = M1 - M2 + M3 + M6
  return combine(C11, C12, C21, C22)

Implementation notes I follow in C:

  • Power-of-two sizes: keep the recursion simple (the repo calls this out explicitly). Pad or switch to the base kernel when uneven
  • Workspace reuse: allocate temporaries once per level; pass submatrix views (pointer + leading dimension), not copies.
  • Base cutoff $b_0$: tune on the target CPU (range 64–256 is common in practice).

Build → run

# build
make

# single size (example)
./matrix_benchmark 256 3

# sweep a few sizes and log results (CSV + log file)
./scripts/run_exp.sh

All of this is in the repo’s README and scripts; I keep the interface stable