Skip to main content

Algorithms Foundations

Open Problems in Matrix Computation

The unsolved questions in numerical linear algebra: the true exponent of matrix multiplication, practical fast algorithms, sparse matrix multiplication, randomized methods, and why these matter for scaling ML.

AdvancedTier 3CurrentSupporting~45 min

Why This Matters

Matrix computation is the bottleneck of ML training and inference. The largest language models spend the vast majority of their FLOPS on matrix multiplications. Any theoretical breakthrough in matrix algorithms would have immediate practical implications, yet several fundamental questions remain open after decades of research.

This page catalogs the major open problems, states what is known, and explains why each matters for ML.

Problem 1: The True Exponent of Matrix Multiplication

Status: Open since 1969.

Definition

Matrix Multiplication Exponent

ω=inf{α:two n×n matrices can be multiplied in O(nα) operations}\omega = \inf\{\alpha : \text{two } n \times n \text{ matrices can be multiplied in } O(n^\alpha) \text{ operations}\}.

Known bounds: 2ω2.3715522 \leq \omega \leq 2.371552.

The lower bound ω2\omega \geq 2 is trivial (you must at least read the output). The upper bound comes from Alman and Williams (2024). The gap between 2 and 2.371 remains.

What would resolve it: Either a proof that ω>2\omega > 2 (a super-quadratic lower bound for matrix multiplication) or an algorithm achieving O(n2+ϵ)O(n^{2+\epsilon}) for arbitrarily small ϵ\epsilon.

Barriers: All current upper bound techniques use the "laser method" and its extensions. Ambainis, Filmus, and Le Gall (2015) proved a formal barrier: any algorithm derived purely from the Coppersmith-Winograd tensor via the laser method is bounded below by an exponent strictly above 2. The "group-theoretic" approach proposed by Cohn and Umans is one route around this, but no concrete algorithm using it has matched the laser-method bounds. New ideas are needed.

Machine-discovered improvements: DeepMind's AlphaTensor (Fawzi et al., Nature 2022) used reinforcement learning to find new low-rank decompositions of the matrix multiplication tensor. The system rediscovered Strassen's scheme and improved several specific block-size cases, including a 4×54 \times 5 by 5×55 \times 5 multiplication using 76 multiplications. Follow-up systems (AlphaEvolve, 2024-2025) have continued to find domain-specific improvements. None of these have improved the asymptotic exponent ω\omega, but they have expanded the catalogue of practical schemes for fixed block sizes.

Why it matters for ML: If ω=2\omega = 2, then in principle, the forward pass of a linear layer with dd input and dd output dimensions could cost O(d2)O(d^2) instead of O(d3)O(d^3). For a transformer with hidden dimension d=12288d = 12288 (GPT-4 scale), this is a factor of 1228812288 in the exponent difference.

Problem 2: A Practical O(n2+ϵ)O(n^{2+\epsilon}) Algorithm

Status: Open. No known algorithm with exponent below 2.807 (Strassen) is practical.

Proposition

The Practicality Barrier

Statement

All known matrix multiplication algorithms with exponent ω<2.807\omega < 2.807 have constants and memory access patterns that make them slower than the naive O(n3)O(n^3) algorithm for matrices of sizes encountered in practice (up to roughly n=105n = 10^5).

Intuition

The theoretical algorithms achieve lower exponents by doing more additions and subtractions on cleverly chosen linear combinations of matrix entries. This requires irregular memory access. Modern hardware (GPUs, tensor cores) achieves peak throughput only for regular, predictable access patterns. The theoretical savings in arithmetic are overwhelmed by the practical cost of memory irregularity.

Proof Sketch

This is an empirical observation, not a mathematical theorem. Benchmarks consistently show that highly optimized GEMM implementations (using tiling, SIMD, tensor cores) outperform Strassen-based implementations for matrices up to n104n \approx 10^4 on current hardware.

Why It Matters

The disconnect between theoretical complexity and practical performance is one of the most important open issues in numerical computing. Closing this gap would require either hardware that supports irregular access patterns efficiently or algorithms that achieve low exponents with regular access patterns.

Failure Mode

This "barrier" is hardware-dependent and may not persist as architectures evolve. Some research explores hardware co-design where the algorithm and chip are designed together. Also, for specific structured matrices (sparse, low-rank, Toeplitz), specialized algorithms already beat naive O(n3)O(n^3).

Problem 3: Sparse Matrix Multiplication

Given sparse matrices AA and BB with nnz(A)\text{nnz}(A) and nnz(B)\text{nnz}(B) nonzero entries respectively, can we multiply them faster?

Known results:

  • If the output C=ABC = AB has nnz(C)\text{nnz}(C) nonzeros, we need at least Ω(nnz(A)+nnz(B)+nnz(C))\Omega(\text{nnz}(A) + \text{nnz}(B) + \text{nnz}(C)) time
  • Current algorithms are far from this bound for general sparsity patterns
  • For specific sparsity structures (banded, block-sparse), specialized algorithms exist

Why it matters for ML: Sparse attention, sparse weight matrices (structured pruning), and graph neural networks all involve sparse matrix operations. Faster sparse matrix multiplication would directly speed up these methods.

Problem 4: Optimal Eigenvalue Computation

Status: The optimal complexity of computing all eigenvalues of an n×nn \times n matrix is not known precisely.

Known results:

  • The QR algorithm computes all eigenvalues in O(n3)O(n^3) operations (and O(n2)O(n^2) per iteration, with typically O(n)O(n) iterations needed)
  • For the top kk eigenvalues, Krylov methods (Lanczos, Arnoldi) cost O(knnziterations)O(k \cdot \text{nnz} \cdot \text{iterations}) where nnz\text{nnz} is the number of nonzeros
  • Can we compute all eigenvalues in O(nω)O(n^\omega)? This is open.

Connection to ML: PCA requires the top kk eigenvalues of XTXX^TX. Spectral clustering requires eigenvalues of the graph Laplacian. The cost of these operations limits the scalability of spectral methods.

Problem 5: Randomized Numerical Linear Algebra

Can we get "good enough" answers with random projections?

Theorem

Randomized SVD Approximation

Statement

The randomized SVD algorithm (Halko, Martinsson, Tropp, 2011) produces a rank-kk approximation A~\tilde{A} satisfying:

E[AA~F](1+kp1)1/2(j>kσj2)1/2\mathbb{E}[\|A - \tilde{A}\|_F] \leq \left(1 + \frac{k}{p-1}\right)^{1/2} \left(\sum_{j>k} \sigma_j^2\right)^{1/2}

in O(mnlogk+(m+n)k2)O(mn \log k + (m+n)k^2) time, where pp is the oversampling parameter and σj\sigma_j are the singular values of AA.

Intuition

Project the matrix onto a random low-dimensional subspace, then compute exact SVD in the reduced space. The random projection preserves the top singular directions with high probability if the oversampling parameter pp is large enough (typically p=5p = 5 or 1010 suffices).

Proof Sketch

The key step is showing that a random Gaussian matrix ΩRn×(k+p)\Omega \in \mathbb{R}^{n \times (k+p)} yields a subspace range(AΩ)\text{range}(A\Omega) that captures most of the energy in the top kk singular directions. This follows from concentration inequalities for random projections.

Why It Matters

For m,nkm, n \gg k (which is the typical case in ML), randomized SVD is dramatically faster than deterministic O(mnmin(m,n))O(mn \min(m,n)) SVD. It makes PCA on million-dimensional data feasible.

Failure Mode

The guarantee is in expectation and involves the tail singular values. If the spectrum does not decay (all singular values are similar), the approximation quality degrades. Also, the algorithm requires two passes over the data (or one pass with a slight accuracy loss), which matters for out-of-core computation.

Open questions in randomized NLA:

  • What are the optimal tradeoffs between number of random samples and approximation quality?
  • Can single-pass algorithms match multi-pass accuracy?
  • How robust are randomized methods to adversarial inputs?

Problem 6: Communication Complexity of Matrix Operations

Modern hardware is limited not by arithmetic (FLOPS are cheap) but by data movement (memory bandwidth is expensive). The communication complexity of matrix multiplication (how much data must be moved between levels of memory hierarchy) is partially understood.

Known: For dense matrix multiplication, the communication lower bound is Ω(n3/M)\Omega(n^3 / \sqrt{M}) words moved for cache size MM (Hong-Kung, 1981). GEMM implementations achieve this bound.

Open: What are the communication-optimal algorithms for structured matrix operations (sparse, banded, low-rank)?

Why it matters: On modern GPUs, arithmetic throughput exceeds memory bandwidth by 10x or more. Algorithms that minimize data movement, even at the cost of extra arithmetic, can be faster in practice.

Problem 7: Operator-Fused Matrix Computation in ML

The most consequential matrix-algorithm work for ML in 2022-2026 has not been about lowering ω\omega. It has been about fusing matrix multiplications with their surrounding nonlinearities to avoid materializing intermediate results.

FlashAttention (Dao et al., 2022; FlashAttention-2, 2023; FlashAttention-3, 2024) computes the attention output softmax(QK/d)V\text{softmax}(QK^\top / \sqrt{d}) V without ever materializing the s×ss \times s attention matrix. The key trick is a tiled, online softmax computation that streams blocks of QQ, KK, VV through SRAM. The asymptotic FLOP count is unchanged, but the memory traffic drops by an order of magnitude, which is the actual bottleneck on modern accelerators.

Open questions in operator fusion:

  • What is the right algebraic framework for systematically deriving fused kernels from unfused operator graphs? Triton, Pallas, and ThunderKittens are practical solutions; a clean theory is missing.
  • For low-rank, sparse, or block-structured attention variants, what are the optimal fused implementations?
  • Can fast matrix multiplication algorithms (Strassen-style) be fused with surrounding operations, or does the irregular access pattern preclude fusion?

Why it matters: For LLM inference and training in 2026, FlashAttention-class kernels deliver larger end-to-end speedups than any improvement in ω\omega would. The research frontier on practical matrix-algorithm work for ML lives here.

The Bigger Picture

These problems share a common theme: the gap between what is theoretically possible and what is practically achievable in matrix computation. For ML, the practical algorithms (dense GEMM, cuBLAS, FlashAttention) already squeeze most of the performance from hardware. But the theoretical questions remain:

  • Can we do better in principle?
  • Can theoretical improvements be made practical?
  • What are the fundamental limits?

Common Confusions

Watch Out

Fast matrix multiplication would not automatically speed up all of ML

Even if ω=2\omega = 2, the constant factor matters. An O(n2)O(n^2) algorithm with constant 10610^6 is slower than O(n3)O(n^3) with constant 11 for all n<106n < 10^6. The practical benefit depends on whether the fast algorithm can be made cache-friendly and parallelizable.

Watch Out

Randomized NLA is approximate, not heuristic

Randomized SVD and randomized matrix multiplication have formal guarantees. They are not "just random guessing." The approximation error is bounded with high probability, and the bounds are often near-optimal.

Summary

  • ω[2,2.371]\omega \in [2, 2.371]: the true exponent of matrix multiplication is unknown
  • No sub-Strassen algorithm is practical on current hardware
  • Sparse matrix multiplication lacks tight complexity bounds for general sparsity
  • Optimal eigenvalue computation complexity is open
  • Randomized methods provide near-optimal approximations in reduced time
  • Communication complexity (data movement) often matters more than arithmetic complexity

Exercises

ExerciseAdvanced

Problem

Suppose you have a matrix AR106×103A \in \mathbb{R}^{10^6 \times 10^3} and want to compute the top 10 principal components. Compare the cost of: (a) exact SVD via ATAA^TA, (b) randomized SVD with oversampling p=10p = 10.

ExerciseResearch

Problem

Explain why finding ω=2\omega = 2 would not immediately make training large language models 1000x faster, even though it would reduce the asymptotic complexity of matrix multiplication from O(n3)O(n^3) to O(n2)O(n^2).

References

Canonical:

  • Golub & Van Loan, Matrix Computations (2013), comprehensive treatment of algorithms
  • Halko, Martinsson, Tropp, "Finding Structure with Randomness" (2011), SIAM Review. The randomized SVD bound used above.
  • Hong & Kung, "I/O Complexity: The Red-Blue Pebble Game" (STOC 1981). The Ω(n3/M)\Omega(n^3 / \sqrt{M}) communication lower bound.
  • Burgisser, Clausen, Shokrollahi, Algebraic Complexity Theory (Springer 1997). Foundational treatment of ω\omega and tensor rank.

Current:

  • Alman & Williams, "A Refined Laser Method and Faster Matrix Multiplication" (SODA 2024). Current best ω2.371552\omega \leq 2.371552.
  • Ambainis, Filmus, Le Gall, "Fast Matrix Multiplication: Limitations of the Laser Method" (STOC 2015). Laser-method barrier.
  • Fawzi et al., "Discovering Faster Matrix Multiplication Algorithms with Reinforcement Learning" (Nature 2022). AlphaTensor.
  • Dao, Fu, Ermon, Rudra, Re, "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (NeurIPS 2022).
  • Dao, "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" (2023).
  • Shah et al., "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision" (NeurIPS 2024).
  • Woodruff, "Sketching as a Tool for Numerical Linear Algebra" (2014).
  • Martinsson & Tropp, "Randomized Numerical Linear Algebra" (2020), Acta Numerica.

Next Topics

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

2

Derived topics

1

Graph-backed continuations