Numerical Optimization
Bayesian Optimization for Hyperparameters
Hyperparameter tuning as black-box optimization with a Gaussian process surrogate. Acquisition functions (EI, UCB, PI), sample efficiency for expensive evaluations, TPE as an alternative, and why grid search is wasteful.
Why This Matters
Training a neural network takes hours or days. Evaluating a single hyperparameter configuration requires a full training run. You cannot afford thousands of evaluations. Bayesian optimization uses a probabilistic model of the objective function to choose which configurations to evaluate next, achieving good results in far fewer evaluations than grid or random search.
The key insight: after each evaluation, you have information about the objective surface. Use that information to decide where to look next, rather than searching blindly.
Problem Setup
Hyperparameter Optimization as Black-Box Optimization
Let be the objective function (e.g., validation loss as a function of hyperparameters). We want to find:
We can only evaluate at chosen points. Each evaluation is expensive (a full training run). We do not have access to . The function may be noisy (validation loss varies across random seeds).
The Bayesian optimization loop:
- Fit a surrogate model to all observations
- Use the surrogate to compute an acquisition function
- Evaluate at
- Add to observations. Repeat.
The most important practical distinction is that there are two optimization problems in the loop. The outer problem is the expensive one you really care about: train a model and read off validation loss. The inner problem is cheap: optimize the acquisition function to decide which expensive run is still worth paying for.
The Surrogate Model: Gaussian Process
A Gaussian process provides a posterior distribution over given observations. After observing evaluations, the GP posterior gives a mean prediction and uncertainty at any point :
The mean is the best estimate of . The variance is high where we have few observations (uncertain regions) and low near observed points.
Embedded Figure
Bayesian optimization pays for one more expensive run only where it still learns something
The surrogate model estimates validation loss across the search space. The acquisition rule then asks a second question: where is the combination of low predicted loss and unresolved uncertainty still worth another full training run?
Acquisition rule
Expected Improvement
EI spends budget where the mean already looks promising and the uncertainty is still wide enough to pay for another run.
Acquisition rule
Upper Confidence Bound
UCB moves farther into uncertain regions as the exploration weight grows, which is why it comes with regret theory but also a tuning knob.
Acquisition rule
Probability of Improvement
PI asks only whether improvement is likely, so it often hugs the incumbent and can get greedy too early.
Why this beats grid search
The point is not that the GP is magically correct. The point is that the optimizer keeps carrying forward information from earlier runs instead of spending the budget on regions that already look clearly bad.
Where the method breaks
If the search space is high-dimensional, heavily categorical, or dominated by random-seed noise, the surrogate can become less informative than a strong random-search baseline.
Acquisition Functions
The acquisition function balances exploitation (evaluate where is low, i.e., the current best guess) with exploration (evaluate where is high, i.e., uncertain regions that might contain the optimum).
Expected Improvement Closed Form
Statement
The Expected Improvement acquisition function has a closed-form expression under a GP posterior:
where , and , are the standard normal CDF and PDF respectively. When , .
Intuition
EI asks: "In expectation, how much better than the current best would an evaluation at be?" The first term rewards points predicted to be good (). The second term rewards uncertainty ( large), because uncertain points might be much better than predicted. Both exploitation and exploration emerge from a single principled criterion.
Proof Sketch
. Substitute and split into two standard Gaussian integrals. The result follows from the identities and .
Why It Matters
EI is the most widely used acquisition function in practice. Its closed form makes it cheap to evaluate and optimize. It naturally balances exploration and exploitation without requiring a tuning parameter (unlike UCB).
Failure Mode
EI can be myopic: it optimizes the expected improvement for the next single evaluation, not for a budget of remaining evaluations. It can also get stuck in local optima of the acquisition function if the GP posterior is multimodal.
Upper Confidence Bound (UCB): choose where is a parameter controlling exploration. Larger means more exploration.
Probability of Improvement (PI): choose the point most likely to improve on : with the same as EI. PI is purely exploitative and tends to over-exploit in practice.
In real hyperparameter tuning, the winner is often not the acquisition function with the prettiest formula. It is the one whose behavior matches your budget and noise level. EI is a strong default because it usually balances "likely good" against "still unresolved." UCB is easier to analyze theoretically because the exploration weight is explicit. PI is easy to compute, but it is usually the first one to get stuck hugging the current incumbent.
Regret Bounds
GP-UCB Cumulative Regret Bound
Statement
Let be a function in the RKHS of kernel with RKHS norm at most . Running GP-UCB with for rounds, the cumulative regret satisfies, with probability at least :
where is the maximum information gain after observations, which depends on the kernel. For the squared exponential kernel in dimensions, .
Intuition
The regret grows sublinearly in : the average regret per evaluation goes to zero. The rate depends on the kernel's information gain , which measures how quickly the GP learns about . Smoother functions (smaller ) are easier to optimize.
Proof Sketch
At each round, the UCB acquisition function ensures that with high probability. The regret at round is bounded by . Summing and applying Cauchy-Schwarz gives . The sum of predictive variances is bounded by the information gain .
Why It Matters
This provides a theoretical guarantee for Bayesian optimization. It shows that GP-UCB finds near-optimal hyperparameters with a number of evaluations that depends on the smoothness of the objective (via ) rather than the dimensionality of the search space (which grid search depends on exponentially).
Failure Mode
The bound requires to lie in the RKHS of the chosen kernel. If the kernel is misspecified (e.g., using a smooth SE kernel for a non-smooth objective), the bound does not hold. The bound also grows with input dimension through , making Bayesian optimization less efficient for high-dimensional search spaces (typically ).
Tree-Structured Parzen Estimator (TPE)
TPE (Bergstra et al., 2011) models instead of . It splits observed hyperparameters into "good" () and "bad" () groups, fits kernel density estimators and to each, and maximizes the ratio .
Advantage over GP: TPE handles categorical and conditional hyperparameters naturally (e.g., "use Adam" vs "use SGD", where Adam has a parameter that SGD does not).
Disadvantage: TPE lacks the theoretical guarantees of GP-UCB. The density estimation can be poor with few observations.
Why Grid Search is Wasteful
Grid search evaluates a regular grid over the hyperparameter space. For hyperparameters with values each, it requires evaluations. With and , this is 100,000 evaluations.
Bergstra & Bengio (2012) showed that random search outperforms grid search in the same budget because: (1) in most problems, only a few hyperparameters matter, and (2) grid search wastes evaluations exploring unimportant dimensions. Random search projects uniformly onto each dimension, giving better coverage of the important ones.
Bayesian optimization further improves on random search by using past results to guide future evaluations.
Common Confusions
Bayesian optimization is not Bayesian hyperparameter inference
Bayesian optimization uses a Bayesian model (the GP) as a tool for optimization. It does not compute a posterior over hyperparameters. The GP models the objective function , not the hyperparameter distribution. The output is a point estimate , not a posterior.
GP-UCB and EI are not equivalent
Both balance exploration and exploitation, but differently. UCB has an explicit exploration parameter that must be set. EI has no explicit parameter but can be myopic. They lead to different evaluation sequences and different theoretical guarantees.
Canonical Examples
Tuning learning rate and weight decay
Objective: validation loss of a ResNet-50 as a function of learning rate (log scale) and weight decay (log scale). A GP with Matern-5/2 kernel over the log-transformed space. After 20 evaluations (each a full ImageNet training run), Bayesian optimization typically finds a configuration within 0.5% of the best found by 200 random search evaluations.
Summary
- Bayesian optimization uses a GP surrogate to model the objective function
- Acquisition functions (EI, UCB, PI) decide where to evaluate next
- EI has a closed form and balances exploration/exploitation without tuning
- GP-UCB has sublinear regret guarantees depending on kernel smoothness
- TPE handles categorical hyperparameters but lacks theoretical guarantees
- Grid search is exponential in dimension; random search is a better baseline
- Bayesian optimization is most valuable when each evaluation is expensive
Exercises
Problem
You have budget for 30 evaluations of a function with 3 continuous hyperparameters. Compare the number of distinct values each hyperparameter gets under grid search ( values each, giving 27 evaluations) versus random search (30 random points). Which gives better coverage per dimension?
Problem
Derive the Expected Improvement when and when . Interpret both cases.
Problem
You are tuning a large language model with a search space containing optimizer choice, learning-rate schedule, weight decay, warmup length, batch size, and several categorical architecture knobs. Seed-to-seed noise is large. Why can a careful random-search baseline still beat Bayesian optimization here even if the BO implementation is mathematically correct?
References
Canonical and technical:
- Jones, Schonlau, and Welch, "Efficient Global Optimization of Expensive Black-Box Functions" (1998)
- Srinivas et al., "Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design" (2010), Sections 2-5
- Bergstra and Bengio, "Random Search for Hyper-Parameter Optimization" (2012)
Practice and synthesis:
- Bergstra et al., "Algorithms for Hyper-Parameter Optimization" (TPE, 2011)
- Snoek, Larochelle, and Adams, "Practical Bayesian Optimization of Machine Learning Algorithms" (2012)
- Shahriari et al., "Taking the Human Out of the Loop: A Review of Bayesian Optimization" (2016), Sections 2-5
- Frazier, "A Tutorial on Bayesian Optimization" (2018), Sections 1-3
Next Topics
- Gaussian Processes Regression for the posterior formulas and kernel fitting underneath the surrogate
- Multi-Armed Bandits Theory for the exploration-versus-exploitation logic behind regret bounds
- Model Evaluation Best Practices for the train/validation/test discipline that keeps hyperparameter search honest
Last reviewed: April 23, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Gaussian Process Regressionlayer 3 · tier 2
- Gaussian Processes for Machine Learninglayer 4 · tier 3
Derived topics
2- Model Evaluation Best Practiceslayer 1 · tier 1
- Multi-Armed Bandits Theorylayer 2 · tier 2
Graph-backed continuations