Skip to main content

LLM Construction

Neural Architecture Search

Automating network architecture design: search spaces, search strategies (RL, evolutionary, differentiable), performance estimation via weight sharing, and the gap between NAS hype and practical gains.

AdvancedTier 3CurrentSupporting~50 min

Why This Matters

Human-designed architectures (ResNet, Transformer) work well, but there is no guarantee they are optimal for a given task and compute budget. Neural Architecture Search (NAS) attempts to automate architecture design by searching over a structured space of possible networks.

NAS produced EfficientNet, which achieved state-of-the-art ImageNet accuracy at lower compute than hand-designed alternatives. However, NAS is also one of the most over-hyped areas of ML: the search cost can be enormous, the search spaces are heavily constrained by human priors, and many "NAS-found" architectures differ only marginally from hand-designed ones.

Formal Setup

A NAS problem consists of three components.

Definition

Search Space

The search space A\mathcal{A} is the set of all architectures the search can consider. Typically parameterized as a directed acyclic graph where nodes are feature maps and edges are operations (convolution, pooling, skip connection). The space is finite but combinatorially large.

Definition

Search Strategy

The search strategy selects which architectures to evaluate from A\mathcal{A}. Common strategies: reinforcement learning (controller generates architectures, reward is validation accuracy), evolutionary algorithms (population of architectures, mutation and selection), gradient-based optimization (DARTS).

Definition

Performance Estimation Strategy

The performance estimation strategy approximates the true validation performance of a candidate architecture without training it fully from scratch. Methods: training for fewer epochs, weight sharing across architectures (supernets), learning curve extrapolation.

Search Strategies

Reinforcement Learning (Zoph and Le, 2017)

A recurrent neural network (the "controller") generates architecture descriptions token by token. Each generated architecture is trained to convergence, and the validation accuracy serves as the reward signal. The controller is updated with REINFORCE.

The original NAS paper used 800 GPUs for 28 days. This established NAS as a concept but also demonstrated its impracticality at scale.

Evolutionary Methods

Maintain a population of architectures. At each step: select a parent, mutate it (add/remove a layer, change an operation), train the child, and add it to the population if it improves upon the weakest member. AmoebaNet (Real et al., 2019) showed evolutionary NAS matches RL-based NAS at lower cost.

Differentiable Architecture Search (DARTS)

Proposition

DARTS Continuous Relaxation

Statement

The standard DARTS cell has 4 intermediate nodes. During search, the cell is a dense DAG: each intermediate node ii is connected to every preceding node (the two cell inputs plus all earlier intermediate nodes), so the search supergraph has 2+3+4+5=142 + 3 + 4 + 5 = 14 candidate edges, and on each candidate edge DARTS maintains a mixture over operations. After search, at discretization time, each intermediate node keeps the two highest-scoring incoming edges (and on each kept edge, the single highest-scoring non-zero operation), producing the standard 8-edge final cell. The theorem statement and the α\alpha-relaxation below describe the search-time supergraph; the 8-edge count refers to the discrete cell that gets retrained for evaluation.

Let O={o1,,oK}\mathcal{O} = \{o_1, \ldots, o_K\} be the set of candidate operations for each edge. DARTS replaces the discrete choice with a continuous mixture:

oˉ(x)=k=1Kexp(αk)j=1Kexp(αj)ok(x)\bar{o}(x) = \sum_{k=1}^{K} \frac{\exp(\alpha_k)}{\sum_{j=1}^{K} \exp(\alpha_j)} \cdot o_k(x)

where αk\alpha_k are architecture parameters. The bilevel optimization is:

minαLval(w(α),α)s.t.w(α)=argminwLtrain(w,α)\min_\alpha \mathcal{L}_{\text{val}}(w^*(\alpha), \alpha) \quad \text{s.t.} \quad w^*(\alpha) = \arg\min_w \mathcal{L}_{\text{train}}(w, \alpha)

After optimization, the final architecture is obtained by selecting argmaxkαk\arg\max_k \alpha_k for each edge. This discretization step introduces a gap between the search-time performance of the mixture oˉ\bar{o} and the deployed performance of the discrete architecture. The gap is not a minor detail: it is a core failure mode of the method, and closing it is the subject of follow-up work.

Intuition

Instead of searching over discrete architectures (combinatorial), DARTS relaxes the problem to a continuous optimization over mixing weights. You jointly train the network weights ww and the architecture parameters α\alpha using gradient descent. This reduces NAS from days to hours on CIFAR-10.

Proof Sketch

The relaxation is valid because the softmax-weighted sum approaches a hard selection as the α\alpha values diverge. In practice, the bilevel optimization is approximated: alternate one step of ww update (on training data) with one step of α\alpha update (on validation data). Liu et al. (2019) showed this approximation works empirically but can be unstable.

Why It Matters

DARTS reduced NAS cost on CIFAR-10 from thousands of GPU-days (the Zoph and Le RL controller) to roughly a single GPU-day. This made CIFAR-10 NAS accessible to researchers without large compute budgets and established differentiable NAS as the dominant paradigm. ImageNet-scale search with DARTS-family methods costs substantially more, and most reported ImageNet results transfer a cell found on CIFAR-10 rather than searching directly.

Failure Mode

DARTS suffers from collapse: the search often converges to architectures dominated by skip connections and parameter-free operations because these are easy to optimize. The bilevel approximation (one-step unrolling) introduces bias, and the discretization gap noted in the statement compounds the problem. Several follow-up works (DARTS+, FairDARTS, SDARTS) address collapse by regularizing the architecture parameters or sharpening the softmax during search.

Weight Sharing and Supernets

Training every candidate architecture from scratch is prohibitively expensive. Weight sharing trains a single large network (the supernet or one-shot model) that contains all candidate architectures as subgraphs. To evaluate a candidate, extract its subgraph and use the shared weights.

The assumption: a subnetwork's performance with shared weights correlates with its performance when trained independently. This assumption often fails. The ranking of architectures under shared weights can differ substantially from their ranking after independent training. This is the main weakness of one-shot NAS.

EfficientNet: A NAS Success Story

Tan and Le (2019) used NAS to search over a mobile-sized architecture space, finding EfficientNet-B0. They then applied a compound scaling rule (scale depth, width, and resolution together with fixed ratios) to produce EfficientNet-B1 through B7. EfficientNet-B7 matched the best ImageNet accuracy at the time with 8.4x fewer parameters than the previous state of the art.

The success was partly NAS and partly the scaling rule. Disentangling the contribution of the search from the contribution of the scaling methodology is difficult.

Honest Assessment of NAS

What NAS does well:

  • Finds good architectures within a constrained search space
  • Removes some human bias in architecture design
  • Compound scaling (from EfficientNet) is a genuine contribution

What NAS does poorly:

  • The search space itself is designed by humans, baking in strong priors
  • Search cost can exceed the cost of training the final model many times over
  • Weight sharing introduces ranking errors
  • Many NAS papers compare against weak baselines or use different training recipes
  • For LLMs, the Transformer architecture has held up across scales; NAS has not produced a replacement that wins on matched compute

Reproducibility and the Search-Phase Critique

Yu, Sciuto, Jaggi, Musat, and Salzmann (2020, "Evaluating the Search Phase of Neural Architecture Search", ICLR 2020) show that for several prominent NAS methods, random search with early stopping is competitive with the reported search strategies on the same search spaces. They also document reproducibility issues: results vary substantially across seeds, and many papers report a single run. Li and Talwalkar (UAI 2020) make a similar point. The practical takeaway is that claimed gains should be benchmarked against random search on the same search space with the same training recipe.

NAS Benchmarks

Tabular benchmarks isolate the search strategy from the training pipeline by precomputing the validation accuracy of every architecture in a fixed space.

  • NAS-Bench-101 (Ying, Klein, Christiansen, Real, Murphy, Hutter, 2019, ICML): 423k CNN cells on CIFAR-10 with full training curves.
  • NAS-Bench-201 (Dong, Yang, 2020, ICLR): 15,625 architectures on CIFAR-10, CIFAR-100, and ImageNet-16-120, useful for cross-dataset generalization of search algorithms.
  • NAS-Bench-301 (Siems, Zimmer, Zela, Lukasik, Keuper, Hutter, 2020): surrogate benchmark for the DARTS search space, which is too large to enumerate exhaustively.

These benchmarks turn a NAS experiment from a multi-GPU-week run into a lookup, which is what made the Yu et al. reproducibility audit possible.

Zero-Cost Proxies

Zero-cost proxies score a randomly initialized network from a single minibatch, without any training. Mellor, Turner, Storkey, and Crowley (2021, "Neural Architecture Search without Training", ICML 2021) introduce NASWOT, which scores a candidate by the linear-separability of its activations across a minibatch at initialization. Abdelfattah, Mehrotra, Dudziak, and Lane (2021, "Zero-Cost Proxies for Lightweight NAS", ICLR 2021) evaluate a battery of such proxies (grad-norm, snip, grasp, synflow, fisher, jacob-cov) on NAS-Bench-101 and NAS-Bench-201. The proxies correlate imperfectly with trained accuracy but can prune search spaces by orders of magnitude at negligible cost. TE-NAS (Chen, Gong, Wang, 2021) combines NTK condition number and linear-region count as a training-free ranking signal.

Hardware-Aware NAS

Search objectives that only optimize accuracy ignore deployment constraints. Hardware-aware NAS adds latency or energy terms.

  • MobileNet-V3 (Howard, Sandler, Chu, Chen, Chen, Tan, Wang, Zhu, Pang, Vasudevan, Le, Adam, 2019): uses platform-aware NAS plus NetAdapt-style fine-tuning, targeting on-device inference latency rather than FLOPs.
  • FBNet (Wu, Dai, Zhang, Wang, Sun, Wu, Tian, Vajda, Jia, Keutzer, 2019): differentiable hardware-aware NAS with a latency lookup table used as a regularizer on α\alpha.
  • ProxylessNAS (Cai, Zhu, Han, 2019, ICLR): removes the proxy task (small dataset, shallow cells) by searching directly on the target task and hardware using path-level binarization to fit one supernet in memory.
  • Once-for-All (Cai, Gan, Wang, Zhang, Han, 2020, ICLR): trains a single supernet once, then extracts specialized subnetworks for different latency budgets without retraining, amortizing search cost across many deployment targets.

Common Confusions

Watch Out

NAS searches architectures, not hyperparameters

NAS operates over the structure of the network (number of layers, operation types, connectivity). Hyperparameter optimization (learning rate, batch size, weight decay) is a separate problem. Some frameworks combine both, but the distinction matters for understanding what NAS actually automates.

Watch Out

DARTS is not truly differentiable over architectures

DARTS makes the relaxed problem differentiable, but the final architecture is obtained by discretizing (argmax). The discretization gap means the relaxed optimum may not correspond to a good discrete architecture. This is the source of the skip-connection collapse problem.

Canonical Examples

Example

DARTS search space

Consider a standard DARTS cell with 4 intermediate nodes. Each intermediate node receives input from exactly 2 predecessors (chosen from prior intermediate nodes and the two cell inputs), giving 8 edges per cell after discretization. For each edge, there are 7 candidate operations: 3×33 \times 3 separable conv, 5×55 \times 5 separable conv, 3×33 \times 3 dilated conv, 5×55 \times 5 dilated conv, 3×33 \times 3 max pool, 3×33 \times 3 average pool, skip connection (plus a zero op that is discarded at selection). With 8 retained edges each carrying 7 choices, the discrete search space is on the order of 785.8×1067^{8} \approx 5.8 \times 10^{6} per-cell configurations, before accounting for the predecessor selection step. DARTS explores this space with a continuous α\alpha parameter per candidate operation per edge.

Exercises

ExerciseCore

Problem

In DARTS, why is the architecture optimized on validation data while network weights are optimized on training data? What would go wrong if both used training data?

ExerciseAdvanced

Problem

A one-shot NAS evaluates 1000 candidate architectures using shared weights from a supernet. The Kendall rank correlation between shared-weight accuracy and independently-trained accuracy is τ=0.3\tau = 0.3. Is this sufficient for NAS to find a good architecture? Justify quantitatively.

References

Canonical:

  • Zoph & Le, "Neural Architecture Search with Reinforcement Learning" (ICLR 2017)
  • Liu, Simonyan, Yang, "DARTS: Differentiable Architecture Search" (ICLR 2019), Sections 2-4
  • Tan & Le, "EfficientNet: Rethinking Model Scaling for CNNs" (ICML 2019)
  • Elsken, Metzen, Hutter, "Neural Architecture Search: A Survey" (JMLR 2019), Sections 2-5

Benchmarks and reproducibility:

  • Ying, Klein, Christiansen, Real, Murphy, Hutter, "NAS-Bench-101: Towards Reproducible Neural Architecture Search" (ICML 2019)
  • Dong, Yang, "NAS-Bench-201: Extending the Scope of Reproducible Neural Architecture Search" (ICLR 2020)
  • Siems, Zimmer, Zela, Lukasik, Keuper, Hutter, "NAS-Bench-301 and the Case for Surrogate Benchmarks for Neural Architecture Search" (2020)
  • Yu, Sciuto, Jaggi, Musat, Salzmann, "Evaluating the Search Phase of Neural Architecture Search" (ICLR 2020)
  • Li & Talwalkar, "Random Search and Reproducibility for Neural Architecture Search" (UAI 2020)

Zero-cost proxies:

  • Mellor, Turner, Storkey, Crowley, "Neural Architecture Search without Training" (ICML 2021)
  • Abdelfattah, Mehrotra, Dudziak, Lane, "Zero-Cost Proxies for Lightweight NAS" (ICLR 2021)
  • Chen, Gong, Wang, "Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective" (TE-NAS, ICLR 2021)

Hardware-aware NAS and deployment:

  • Cai, Zhu, Han, "ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware" (ICLR 2019)
  • Wu, Dai, Zhang, Wang, Sun, Wu, Tian, Vajda, Jia, Keutzer, "FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search" (CVPR 2019)
  • Howard et al., "Searching for MobileNetV3" (ICCV 2019)
  • Cai, Gan, Wang, Zhang, Han, "Once for All: Train One Network and Specialize it for Efficient Deployment" (ICLR 2020)

Next Topics

The ideas from NAS connect to broader AutoML and efficient model design.

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

1

Derived topics

0

No published topic currently declares this as a prerequisite.