Algorithms Foundations
Open Problems in Matrix Computation
The unsolved questions in numerical linear algebra: the true exponent of matrix multiplication, practical fast algorithms, sparse matrix multiplication, randomized methods, and why these matter for scaling ML.
Why This Matters
Matrix computation is the bottleneck of ML training and inference. The largest language models spend the vast majority of their FLOPS on matrix multiplications. Any theoretical breakthrough in matrix algorithms would have immediate practical implications, yet several fundamental questions remain open after decades of research.
This page catalogs the major open problems, states what is known, and explains why each matters for ML.
Problem 1: The True Exponent of Matrix Multiplication
Status: Open since 1969.
Matrix Multiplication Exponent
.
Known bounds: .
The lower bound is trivial (you must at least read the output). The upper bound comes from Alman and Williams (2024). The gap between 2 and 2.371 remains.
What would resolve it: Either a proof that (a super-quadratic lower bound for matrix multiplication) or an algorithm achieving for arbitrarily small .
Barriers: All current upper bound techniques use the "laser method" and its extensions. Ambainis, Filmus, and Le Gall (2015) proved a formal barrier: any algorithm derived purely from the Coppersmith-Winograd tensor via the laser method is bounded below by an exponent strictly above 2. The "group-theoretic" approach proposed by Cohn and Umans is one route around this, but no concrete algorithm using it has matched the laser-method bounds. New ideas are needed.
Machine-discovered improvements: DeepMind's AlphaTensor (Fawzi et al., Nature 2022) used reinforcement learning to find new low-rank decompositions of the matrix multiplication tensor. The system rediscovered Strassen's scheme and improved several specific block-size cases, including a by multiplication using 76 multiplications. Follow-up systems (AlphaEvolve, 2024-2025) have continued to find domain-specific improvements. None of these have improved the asymptotic exponent , but they have expanded the catalogue of practical schemes for fixed block sizes.
Why it matters for ML: If , then in principle, the forward pass of a linear layer with input and output dimensions could cost instead of . For a transformer with hidden dimension (GPT-4 scale), this is a factor of in the exponent difference.
Problem 2: A Practical Algorithm
Status: Open. No known algorithm with exponent below 2.807 (Strassen) is practical.
The Practicality Barrier
Statement
All known matrix multiplication algorithms with exponent have constants and memory access patterns that make them slower than the naive algorithm for matrices of sizes encountered in practice (up to roughly ).
Intuition
The theoretical algorithms achieve lower exponents by doing more additions and subtractions on cleverly chosen linear combinations of matrix entries. This requires irregular memory access. Modern hardware (GPUs, tensor cores) achieves peak throughput only for regular, predictable access patterns. The theoretical savings in arithmetic are overwhelmed by the practical cost of memory irregularity.
Proof Sketch
This is an empirical observation, not a mathematical theorem. Benchmarks consistently show that highly optimized GEMM implementations (using tiling, SIMD, tensor cores) outperform Strassen-based implementations for matrices up to on current hardware.
Why It Matters
The disconnect between theoretical complexity and practical performance is one of the most important open issues in numerical computing. Closing this gap would require either hardware that supports irregular access patterns efficiently or algorithms that achieve low exponents with regular access patterns.
Failure Mode
This "barrier" is hardware-dependent and may not persist as architectures evolve. Some research explores hardware co-design where the algorithm and chip are designed together. Also, for specific structured matrices (sparse, low-rank, Toeplitz), specialized algorithms already beat naive .
Problem 3: Sparse Matrix Multiplication
Given sparse matrices and with and nonzero entries respectively, can we multiply them faster?
Known results:
- If the output has nonzeros, we need at least time
- Current algorithms are far from this bound for general sparsity patterns
- For specific sparsity structures (banded, block-sparse), specialized algorithms exist
Why it matters for ML: Sparse attention, sparse weight matrices (structured pruning), and graph neural networks all involve sparse matrix operations. Faster sparse matrix multiplication would directly speed up these methods.
Problem 4: Optimal Eigenvalue Computation
Status: The optimal complexity of computing all eigenvalues of an matrix is not known precisely.
Known results:
- The QR algorithm computes all eigenvalues in operations (and per iteration, with typically iterations needed)
- For the top eigenvalues, Krylov methods (Lanczos, Arnoldi) cost where is the number of nonzeros
- Can we compute all eigenvalues in ? This is open.
Connection to ML: PCA requires the top eigenvalues of . Spectral clustering requires eigenvalues of the graph Laplacian. The cost of these operations limits the scalability of spectral methods.
Problem 5: Randomized Numerical Linear Algebra
Can we get "good enough" answers with random projections?
Randomized SVD Approximation
Statement
The randomized SVD algorithm (Halko, Martinsson, Tropp, 2011) produces a rank- approximation satisfying:
in time, where is the oversampling parameter and are the singular values of .
Intuition
Project the matrix onto a random low-dimensional subspace, then compute exact SVD in the reduced space. The random projection preserves the top singular directions with high probability if the oversampling parameter is large enough (typically or suffices).
Proof Sketch
The key step is showing that a random Gaussian matrix yields a subspace that captures most of the energy in the top singular directions. This follows from concentration inequalities for random projections.
Why It Matters
For (which is the typical case in ML), randomized SVD is dramatically faster than deterministic SVD. It makes PCA on million-dimensional data feasible.
Failure Mode
The guarantee is in expectation and involves the tail singular values. If the spectrum does not decay (all singular values are similar), the approximation quality degrades. Also, the algorithm requires two passes over the data (or one pass with a slight accuracy loss), which matters for out-of-core computation.
Open questions in randomized NLA:
- What are the optimal tradeoffs between number of random samples and approximation quality?
- Can single-pass algorithms match multi-pass accuracy?
- How robust are randomized methods to adversarial inputs?
Problem 6: Communication Complexity of Matrix Operations
Modern hardware is limited not by arithmetic (FLOPS are cheap) but by data movement (memory bandwidth is expensive). The communication complexity of matrix multiplication (how much data must be moved between levels of memory hierarchy) is partially understood.
Known: For dense matrix multiplication, the communication lower bound is words moved for cache size (Hong-Kung, 1981). GEMM implementations achieve this bound.
Open: What are the communication-optimal algorithms for structured matrix operations (sparse, banded, low-rank)?
Why it matters: On modern GPUs, arithmetic throughput exceeds memory bandwidth by 10x or more. Algorithms that minimize data movement, even at the cost of extra arithmetic, can be faster in practice.
Problem 7: Operator-Fused Matrix Computation in ML
The most consequential matrix-algorithm work for ML in 2022-2026 has not been about lowering . It has been about fusing matrix multiplications with their surrounding nonlinearities to avoid materializing intermediate results.
FlashAttention (Dao et al., 2022; FlashAttention-2, 2023; FlashAttention-3, 2024) computes the attention output without ever materializing the attention matrix. The key trick is a tiled, online softmax computation that streams blocks of , , through SRAM. The asymptotic FLOP count is unchanged, but the memory traffic drops by an order of magnitude, which is the actual bottleneck on modern accelerators.
Open questions in operator fusion:
- What is the right algebraic framework for systematically deriving fused kernels from unfused operator graphs? Triton, Pallas, and ThunderKittens are practical solutions; a clean theory is missing.
- For low-rank, sparse, or block-structured attention variants, what are the optimal fused implementations?
- Can fast matrix multiplication algorithms (Strassen-style) be fused with surrounding operations, or does the irregular access pattern preclude fusion?
Why it matters: For LLM inference and training in 2026, FlashAttention-class kernels deliver larger end-to-end speedups than any improvement in would. The research frontier on practical matrix-algorithm work for ML lives here.
The Bigger Picture
These problems share a common theme: the gap between what is theoretically possible and what is practically achievable in matrix computation. For ML, the practical algorithms (dense GEMM, cuBLAS, FlashAttention) already squeeze most of the performance from hardware. But the theoretical questions remain:
- Can we do better in principle?
- Can theoretical improvements be made practical?
- What are the fundamental limits?
Common Confusions
Fast matrix multiplication would not automatically speed up all of ML
Even if , the constant factor matters. An algorithm with constant is slower than with constant for all . The practical benefit depends on whether the fast algorithm can be made cache-friendly and parallelizable.
Randomized NLA is approximate, not heuristic
Randomized SVD and randomized matrix multiplication have formal guarantees. They are not "just random guessing." The approximation error is bounded with high probability, and the bounds are often near-optimal.
Summary
- : the true exponent of matrix multiplication is unknown
- No sub-Strassen algorithm is practical on current hardware
- Sparse matrix multiplication lacks tight complexity bounds for general sparsity
- Optimal eigenvalue computation complexity is open
- Randomized methods provide near-optimal approximations in reduced time
- Communication complexity (data movement) often matters more than arithmetic complexity
Exercises
Problem
Suppose you have a matrix and want to compute the top 10 principal components. Compare the cost of: (a) exact SVD via , (b) randomized SVD with oversampling .
Problem
Explain why finding would not immediately make training large language models 1000x faster, even though it would reduce the asymptotic complexity of matrix multiplication from to .
References
Canonical:
- Golub & Van Loan, Matrix Computations (2013), comprehensive treatment of algorithms
- Halko, Martinsson, Tropp, "Finding Structure with Randomness" (2011), SIAM Review. The randomized SVD bound used above.
- Hong & Kung, "I/O Complexity: The Red-Blue Pebble Game" (STOC 1981). The communication lower bound.
- Burgisser, Clausen, Shokrollahi, Algebraic Complexity Theory (Springer 1997). Foundational treatment of and tensor rank.
Current:
- Alman & Williams, "A Refined Laser Method and Faster Matrix Multiplication" (SODA 2024). Current best .
- Ambainis, Filmus, Le Gall, "Fast Matrix Multiplication: Limitations of the Laser Method" (STOC 2015). Laser-method barrier.
- Fawzi et al., "Discovering Faster Matrix Multiplication Algorithms with Reinforcement Learning" (Nature 2022). AlphaTensor.
- Dao, Fu, Ermon, Rudra, Re, "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (NeurIPS 2022).
- Dao, "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" (2023).
- Shah et al., "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision" (NeurIPS 2024).
- Woodruff, "Sketching as a Tool for Numerical Linear Algebra" (2014).
- Martinsson & Tropp, "Randomized Numerical Linear Algebra" (2020), Acta Numerica.
Next Topics
- Unsolved problems in computer science: the broader landscape including P vs NP
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Eigenvalues and Eigenvectorslayer 0A · tier 1
- Matrix Multiplication Algorithmslayer 1 · tier 2
Derived topics
1- Unsolved Problems in Computer Sciencelayer 3 · tier 2
Graph-backed continuations