Methodology
Convex Tinkering
Taleb's concept applied to ML research: designing small experiments with bounded downside and unbounded upside, and the precise conditions (convex payoff, nonzero variance, ex-post selection) under which this can dominate scale-first approaches.
Why This Matters
Most ML research budgets are spent on large runs that either confirm a hypothesis or waste compute. A 1000-GPU training run that fails teaches you one bit of information at enormous cost. A set of 100 small experiments on 10 GPUs each teaches you far more per dollar, and any single success can be scaled up.
Convex tinkering is the principle that you should design experiments where the downside is bounded (small cost if it fails) and the upside is unbounded (large payoff if it succeeds). This is not vague advice. It has a precise mathematical formulation rooted in the theory of convex functions and Jensen's inequality.
Mental Model
Consider two research strategies:
Strategy A (scale-first): Spend $1M training one large model. If the hypothesis is right, you get a strong result. If it is wrong, you lose $1M and learn only that this specific configuration failed.
Strategy B (tinker-first): Spend $10K each on 100 small experiments. Most will fail. But you learn something from every experiment, and the few successes can be scaled up. Your total downside is the same ($1M), but your expected information gain is far higher.
Strategy B can dominate Strategy A whenever the payoff function is convex in the space of experimental configurations and the experiments genuinely produce variance to exploit and the researcher can recognize the winners ex post. None of these conditions is automatic: if the payoff is concave in the relevant region, if outcomes are nearly deterministic, or if you cannot tell winners from losers, the strict ordering can flip.
Formal Setup and Notation
Convex Payoff Function
A payoff function is convex if and only if for all and :
In the context of ML experiments, is the space of experimental configurations and is the information value or research payoff of running experiment . Convexity means that diversified experiments yield higher expected payoff than a single concentrated bet.
Optionality
An experiment has optionality when you can choose to act on its result or ignore it. Formally, the payoff of an experiment with uncertain outcome is:
where is the threshold for a result to be useful. This is the payoff of a call option. By Jensen's inequality, if is convex (which it is), then diversifying across many experiments with independent outcomes increases .
Core Definitions
The downside of an experiment is the maximum you can lose: the compute cost, the researcher time, the opportunity cost of not running something else.
The upside is the maximum you can gain: a publishable result, a new capability, an insight that redirects the research program.
Call an experiment convex when its upside is much larger than its downside. A small ablation study costs a few GPU-hours and might reveal that a widely-used technique is unnecessary, saving thousands of GPU-hours for everyone. That is a convex bet.
Call an experiment concave when its downside is large relative to its upside. Training a model at full scale to match a known benchmark, when the outcome is either "we match" or "we do not match," is a concave bet. The information gain per dollar is low.
Main Theorems
Jensen's Inequality for Average Research Payoffs
Statement
Let be convex, and let be independent random variables representing experiment outcomes with . Then the expected average payoff satisfies:
Running small experiments and averaging their payoffs yields at least as high expected payoff as the payoff evaluated at the average configuration. When is strictly convex and , the inequality is strict: diversified convex exposure is strictly better than a single bet at the mean.
Intuition
This result is about AVERAGES, not maxima. When the payoff function is convex, variance of the input amplifies the expected output. High-variance outcomes occasionally produce large successes, and the convex payoff function rewards these successes more than it penalizes the failures. This is the opposite of risk aversion: under convex payoffs of diversified exposures, you want more variance (more diverse experiments), not less.
Proof Sketch
By convexity of and Jensen's inequality, for each . Averaging over preserves the bound. The inequality is strict whenever is strictly convex and has nonzero variance.
Why It Matters
This is the mathematical reason why a portfolio of small experiments with convex payoffs yields at least the payoff of a single experiment at the mean configuration. The result is about diversified convex exposure via averaging. The separate gain from optionality (taking the best of trials) requires a stronger argument via order statistics, handled in the proposition below.
Failure Mode
The inequality reverses when is concave. If there are large fixed costs (e.g., building infrastructure that only pays off at scale), then concentrating resources on one large project may dominate. The convexity assumption must be checked for each research context.
Optionality: Expected Best of n Experiments
Statement
Let be convex and non-decreasing, and let be i.i.d. with . Let denote the sample maximum. Then:
The first inequality holds because pointwise and is non-decreasing. The second is Jensen's inequality. Provided is strictly convex (or strictly increasing on the support of ) and has nonzero variance, when the researcher keeps only the best of trials the expected payoff strictly exceeds both the average-payoff bound and the single-bet payoff . Without those strictness conditions the inequalities can be tight.
Intuition
Optionality is distinct from averaging. Averaging draws of a convex payoff beats by Jensen. Taking the MAX of draws adds a second layer of gain driven by order statistics: the right tail of i.i.d. draws grows with . The rate of growth of depends on the tail of . For light tails (sub-Gaussian) it grows like . For heavy (regularly varying) tails it grows polynomially in (see de Haan and Ferreira, 2006, for formal rates). Convex tinkering combines both: averaging to guarantee the Jensen gain, and optionality to capture the max.
Proof Sketch
Order statistics: for every , so when is non-decreasing. Taking expectations gives the first inequality. Jensen's inequality on gives the second.
Why It Matters
This is the mathematical reason why researchers who can discard failed trials and scale only the winners dominate researchers who must commit to one bet up front. The gap between and is the quantitative value of optionality. See multi-armed bandits for the adaptive-selection version of this idea.
Failure Mode
The optionality gain requires that you can actually identify the winner ex post. If experiments return noisy signals and you cannot reliably rank above , the realized gain shrinks. For heavy-tailed without finite mean, even the formal expectation may not exist, though median-based versions still apply.
Proof Ideas and Templates Used
The proof uses Jensen's inequality, which is the foundational result connecting convexity and expectation. The same inequality appears in information theory (entropy is concave), optimization (convex relaxation gives lower bounds), and finance (option pricing).
Practical Applications
NanoGPT-Style Speedruns
Training a 124M parameter GPT-2 reproduction in minutes rather than days is a convex tinker. The cost is small (one GPU for a few hours). Each experiment tests a specific hypothesis: does this learning rate schedule help? Does this architecture change matter? The community has learned more per compute-dollar from nanoGPT speedruns than from any single large training run.
Ablation Studies as Optionality
Every ablation study is a convex bet. Removing a component costs one training run. If the component turns out to be unnecessary, the savings on all future runs are enormous. If it turns out to be necessary, you learn that cheaply.
Hyperparameter Sweeps on Small Models
Tuning hyperparameters on a 10M parameter model and transferring to a 1B parameter model is a convex strategy. The cost of the sweep is small. If the hyperparameters transfer (as muP predicts), the payoff is large. If they do not transfer, you lose only the sweep cost.
Canonical Examples
Why scale-first is fragile
Consider a lab that spends $10M training a 100B parameter model on a specific dataset mix. If the dataset mix is suboptimal (which they cannot know in advance), the entire run is wasted. A convex alternative: spend $100K on 100 different dataset mixes at 1B scale. The best mix is likely near-optimal for larger scale. Total cost is $10M + $100K (the large run plus the tinkering), but the probability of the large run succeeding is much higher.
Common Confusions
Convex tinkering is not the same as random search
Random search explores uniformly over a space. Convex tinkering is adaptive: each experiment is designed based on what you learned from previous ones. The key property is bounded downside, not randomness. A carefully designed small experiment is a convex tinker. A random large experiment is not.
Bounded downside does not mean low risk
A portfolio of convex tinkers has bounded downside per experiment but can still fail to produce any useful result. The claim is not that tinkering eliminates risk. The claim is that tinkering converts downside risk into upside variance, which is favorable under convex payoffs.
Some research genuinely requires scale
Certain phenomena only emerge at scale (in-context learning, chain-of-thought reasoning). For studying these phenomena, you need large models. The convex tinkering principle does not say "never train large models." It says "do the cheap experiments first to maximize the probability that the expensive experiment succeeds."
Misreadings of Convex Tinkering
It means chaos and randomness are good
Wrong. Taleb's point is not that disorder is inherently valuable. The point is that under a convex exposure, volatility can help rather than hurt. Without the convexity (bounded downside, open-ended upside), randomness is just noise. Convex tinkering requires structure: cap the cost per trial, increase the number of trials, and retain the option to scale winners. That is disciplined experimentation, not celebration of chaos.
It means do lots of random things with no plan
Wrong. Taleb explicitly contrasts convex trial-and-error with undirected flailing. The 1/N dispersion logic is: lower the cost per trial, increase the number of trials, and minimize the chance of missing upside. Each trial should test something specific. The absence of a grand forecast is not the absence of local hypotheses.
It means passive waiting for luck
Wrong. Convex tinkering is active. You run experiments, observe results, adapt, and scale what works. The passivity critique applies to lottery-ticket thinking ("I will try one thing and hope it works"). Convex tinkering is the opposite: many cheap active experiments designed so that any single success can be amplified.
It is a synonym for entrepreneurship or content creation
Not necessarily. Convex tinkering applies only when the payoff distribution is meaningfully asymmetric and downside is controlled. Many entrepreneurial activities have concave payoffs (large fixed costs, small marginal gains) or uncontrolled downside (betting the company on one product). The convexity condition must be checked, not assumed.
When Convex Tinkering Fails: The Discipline It Requires
Convex tinkering is not leaderboard overfitting. The convexity comes from cheap downside and retained upside. If each trial creates hidden maintenance cost, false confidence from selection bias, reputational risk, or noisy winner-selection that does not survive contact with fresh data, the payoff is no longer convex — it just looks convex until the bill arrives.
Three statistical failure modes a careful operator must guard against:
- Jensen convexity holds: . Cheap, real gain when payoff is convex.
- Optionality holds: . The right tail of trials grows with .
- Winner's curse / selection bias also holds: is biased upward relative to because the noise that pushed the apparent winner up is rarely re-paid on fresh data.
The first two are the gain. The third is the trap. They all live in the same expectation, which is why "many trials, keep the best" is worth doing only when the validated payoff (held-out, replicated, deployed) survives. Without out-of-sample check, "convex tinkering" collapses into noise mining.
The operational discipline:
- Cheap downside cap: write the dollar / GPU-hour / human-time budget per trial before running it, and kill at the cap.
- Pre-registered kill criteria: define what would make a trial a failure before the data come in. "The result was interesting" is not a kill criterion; it is the absence of one.
- Independent validation of winners: the apparent winner from trials is biased upward; re-run on a held-out task / dataset / seed before scaling.
- Scale only the validated option: scaling an unvalidated winner is convex tinkering's single most common failure mode.
The Bayesian-optimization analogue formalizes the same idea: the acquisition function caps each evaluation cost, the surrogate model reasons about posterior uncertainty before committing, and the search terminates when the regret bound says further trials are unlikely to pay off. Convex tinkering without those guardrails is just expensive hope.
Connection to Bayesian Optimization
Bayesian optimization formalizes one version of convex tinkering. The acquisition function (e.g., expected improvement, upper confidence bound) balances exploration (trying uncertain configurations) and exploitation (refining known-good configurations). This is precisely the strategy of exploring where uncertainty is high, which is the core of convex tinkering. The key difference: Bayesian optimization assumes a smooth objective with a known kernel. Convex tinkering as a research philosophy does not assume you can model the objective function.
Exercises
Problem
You have a budget of 100 GPU-hours. You can either (a) run one experiment for 100 GPU-hours or (b) run 10 experiments for 10 GPU-hours each. Under what conditions on the payoff function does strategy (b) dominate strategy (a) in expectation?
Problem
A lab is deciding between training one 70B model or seven 10B models with different architectures. Assume training cost scales linearly with parameters and the chance that any given architecture achieves state-of-the-art on the target benchmark is for 70B and for each 10B model (independently). Which strategy has a higher probability of at least one state-of-the-art result?
Further directions
- Thompson sampling as Bayesian convex tinkering
- Bayesian optimization developed more fully (GP-UCB, EI, acquisition functions)
- Lottery ticket hypothesis connection
- "Failure fraction" metric: what fraction of research attempts should fail for optimal learning
- Anti-patterns: when NOT to convex-tinker (fixed-cost settings, high integration costs)
- Scaling-law and extrapolation discussion (Kaplan, Chinchilla): when small-scale results transfer
References
Canonical:
- Dixit and Pindyck, Investment Under Uncertainty (1994), Chapters 2, 5. Real options and the value of waiting under irreversibility.
- Lattimore and Szepesvari, Bandit Algorithms (2020), Chapters 6-8. Adaptive experimentation, regret bounds, and the bandit generalization of convex tinkering. See also multi-armed bandits.
- Box, Hunter, Hunter, Statistics for Experimenters (2005), Chapters 8-13. Sequential and factorial experimental design.
- de Haan and Ferreira, Extreme Value Theory: An Introduction (2006), Chapters 1-2. Formal rates of growth of as a function of via the Fisher-Tippett-Gnedenko theorem. This is the precise extreme-value backing for the optionality proposition.
Editorial framing (scope-limited):
- Taleb, Antifragile: Things That Gain from Disorder (2012), Chapter 12. Informal exposition of convex payoff exposure. Treat as motivating metaphor, not as mathematical foundation.
- Taleb, The Black Swan (2007), Chapter 17. Optionality under fat-tailed uncertainty. Claims are informal.
Current:
- Karpathy, nanoGPT speedrun experiments (2023-2024), GitHub repository
- Yang et al., Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (muP, 2022)
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Common Inequalitieslayer 0A · tier 1
- Non-Probability Samplinglayer 2 · tier 1
Derived topics
3- Bounded Rationalitylayer 2 · tier 1
- Fat Tails and Heavy-Tailed Distributionslayer 2 · tier 1
- Kelly Criterionlayer 2 · tier 2
Graph-backed continuations