Skip to main content

AI Safety

Adversarial Machine Learning

Attacks across the ML lifecycle: evasion (adversarial examples), poisoning (corrupt training data), model extraction (steal via queries), privacy leakage (membership inference), and LLM jailbreaks. Why defenses are hard.

AdvancedTier 2CurrentSupporting~55 min

Why This Matters

ML models are deployed in safety-critical systems: autonomous vehicles, medical diagnosis, content moderation, fraud detection. Adversarial machine learning studies how attackers can manipulate these systems and what defenders can do about it.

The field began with a startling discovery: adding tiny, imperceptible perturbations to images can cause state-of-the-art classifiers to misclassify with high confidence. Since then, the threat model has expanded far beyond image perturbations to encompass the entire ML lifecycle.

NIST published the Adversarial Machine Learning taxonomy (NIST AI 100-2) to standardize how we think about these threats. Understanding this taxonomy is essential for anyone deploying ML in production.

Mental Model

Think of adversarial ML as a game between an attacker and a defender. The attacker has some knowledge of the model (white-box, black-box, or gray-box) and some capability (perturb inputs, corrupt training data, query the API). The defender wants the model to behave correctly despite the attacker's efforts.

The key insight: standard ML optimizes for average-case performance. Security requires worst-case guarantees. This gap is why adversarial ML is hard.

Formal Setup and Notation

Let fθ:XYf_\theta: \mathcal{X} \to \mathcal{Y} be a classifier with parameters θ\theta. Let \ell be the loss function. Let (x,y)(x, y) be a correctly classified input-label pair.

Definition

Adversarial Example

An adversarial example is a perturbed input x=x+δx' = x + \delta such that:

fθ(x)yandδpϵf_\theta(x') \neq y \quad \text{and} \quad \|\delta\|_p \leq \epsilon

The perturbation δ\delta is bounded in some p\ell_p norm (typically \ell_\infty or 2\ell_2) so that xx' is perceptually similar to xx.

Definition

Adversarial Risk

The adversarial risk (or robust risk) of a classifier fθf_\theta is:

Radv(θ)=E(x,y)D[maxδpϵ(fθ(x+δ),y)]R_{\text{adv}}(\theta) = \mathbb{E}_{(x,y) \sim \mathcal{D}} \left[ \max_{\|\delta\|_p \leq \epsilon} \ell(f_\theta(x + \delta), y) \right]

This replaces the standard expectation with a worst-case inner maximization over the perturbation set.

Definition

Threat Model

A threat model specifies: (1) the attacker's goal (misclassification, targeted misclassification, denial of service), (2) the attacker's knowledge (white-box, black-box, gray-box), (3) the attacker's capability (what they can modify and by how much).

NIST AML Taxonomy

The NIST taxonomy organizes attacks by the ML lifecycle stage they target:

Evasion attacks (inference time): modify inputs to fool a deployed model. This includes adversarial examples for classifiers and jailbreaks for LLMs.

Poisoning attacks (training time): inject malicious data into the training set to degrade model performance or plant backdoors.

Privacy attacks (post-deployment): extract sensitive information about the training data or the model itself.

Abuse attacks (misuse): use the model for purposes it was not designed for, such as generating harmful content.

Evasion Attacks in Detail

Definition

Fast Gradient Sign Method (FGSM)

The simplest evasion attack. Compute the gradient of the loss with respect to the input and step in the sign direction:

x=x+ϵsign(x(fθ(x),y))x' = x + \epsilon \cdot \text{sign}(\nabla_x \ell(f_\theta(x), y))

This is a single-step, \ell_\infty-bounded attack. It is fast but weak compared to iterative methods.

Definition

Projected Gradient Descent (PGD) Attack

An iterative refinement of FGSM (Madry et al. 2018). Starting from x0=xx_0 = x (often with a random initialization inside the ϵ\epsilon-ball), repeat:

xt+1=ΠBϵ(x)X(xt+αsign(xt(fθ(xt),y)))x_{t+1} = \Pi_{\mathcal{B}_\epsilon(x) \cap \mathcal{X}} \left( x_t + \alpha \cdot \text{sign}(\nabla_{x_t} \ell(f_\theta(x_t), y)) \right)

where the projection Π\Pi is onto the intersection of the ϵ\epsilon-ball around xx and the valid input domain X\mathcal{X} (e.g., [0,1]d[0,1]^d for pixel values). In practice this is implemented as two clips: clip to [xϵ,x+ϵ][x - \epsilon, x + \epsilon] for \ell_\infty, then clip to [0,1]d[0,1]^d. Skipping the domain clip produces out-of-distribution inputs that do not correspond to valid images. PGD with random restarts is considered a strong first-order attack.

Carlini-Wagner (C&W) attack reformulates adversarial example generation as an optimization problem with a different objective. It often finds smaller perturbations than PGD but is more computationally expensive.

Transferability: adversarial examples generated for one model often fool other models trained on similar data. This enables black-box attacks: generate adversarial examples on a surrogate model and apply them to the target.

Poisoning Attacks

Data poisoning corrupts the training set. The attacker injects or modifies a fraction of training examples to either degrade overall accuracy or cause specific targeted misclassifications.

Backdoor attacks (trojan attacks) plant a trigger pattern. The model behaves normally on clean inputs but misclassifies any input containing the trigger. For example, a small patch in the corner of an image could cause the model to always predict a target class.

The poisoning rate needed can be surprisingly small. In some settings, poisoning less than 1% of the training data suffices to plant a reliable backdoor.

Model Extraction and Privacy

Model extraction (model stealing): the attacker queries a black-box model API and uses the responses to train a functionally equivalent copy. This threatens intellectual property and also enables white-box attacks on the copy.

Membership inference: given a model and a data point, determine whether that point was in the training set. This is a privacy violation. Overfit models are particularly vulnerable because they memorize training data.

Training data extraction: for large language models, carefully crafted prompts can cause the model to regurgitate memorized training data, including personally identifiable information.

LLM Attack Surfaces

Four distinct attack surfaces exist for LLMs, often conflated in practice:

Jailbreaks: evasion attacks against safety training. The user crafts prompts that bypass RLHF-installed refusals. Categories: role-playing attacks (DAN-style persona priming), encoding attacks (base64, low-resource languages, cipher tricks), multi-turn attacks (gradually escalating), and optimization-based attacks such as GCG (Zou, Wang, Kolter, Fredrikson 2023), which does gradient-based search over adversarial suffixes in token space and transfers across aligned models.

Prompt injection: untrusted content ingested by the LLM (web pages, emails, tool outputs) contains instructions the model follows. The adversary is not the user but a third party whose text reaches the model's context window. Indirect prompt injection (Greshake et al. 2023) is a system-design vulnerability, not a model-weights vulnerability: the LLM has no way to distinguish trusted user instructions from untrusted tool output.

Training-data poisoning: corruption of pretraining or fine-tuning data. Carlini et al. 2023 show that poisoning web-scale datasets is practical at low cost, because popular pretraining crawls have mutable or buyable provenance (expired domains, stale Wikipedia snapshots).

Training-data extraction: large models memorize and can be prompted to regurgitate training-set content, including PII and copyrighted text.

These surfaces require different defenses. Jailbreak-hardening does nothing against prompt injection; input filters do nothing against poisoning.

Main Theorems

Proposition

Robustness-Accuracy Tradeoff (pedagogical toy)

Statement

This is a pedagogical toy in the spirit of Tsipras, Santurkar, Engstrom, Turner, Madry (2019). Consider a balanced two-class Gaussian mixture on Rd\mathbb{R}^d with class conditionals N(+μ,I)\mathcal{N}(+\mu, I) and N(μ,I)\mathcal{N}(-\mu, I), and an 2\ell_2-bounded adversary with budget ϵ<μ\epsilon < \|\mu\|. The Bayes-optimal standard classifier is the linear rule sign(μx)\mathrm{sign}(\mu^\top x) with clean accuracy Φ(μ)\Phi(\|\mu\|). The Bayes-optimal 2\ell_2-robust classifier within this linear family is the same direction but with a shifted threshold, yielding clean accuracy:

accstdrobust=Φ(μϵ)\text{acc}_{\text{std}}^{\text{robust}} = \Phi\left(\|\mu\| - \epsilon\right)

where Φ\Phi is the standard normal CDF. The robustness-accuracy gap is Φ(μ)Φ(μϵ)>0\Phi(\|\mu\|) - \Phi(\|\mu\| - \epsilon) > 0 whenever ϵ>0\epsilon > 0.

Tsipras et al. (2019) establish a stronger separation in a different toy distribution (Bernoulli-Gaussian features): there, any classifier with non-trivial \ell_\infty-robust accuracy must have standard accuracy bounded away from 11, and robust learning requires sample complexity Ω(d)\Omega(\sqrt{d}) worse than standard learning (Schmidt et al. 2018).

Intuition

Robustness requires the classifier to maintain correct predictions even when inputs are pushed toward the decision boundary. This shrinks the effective margin by ϵ\epsilon in the worst-case direction. In this toy, the shift is exactly ϵ\epsilon along μ^\hat{\mu}, so the clean-accuracy drop is Φ(μ)Φ(μϵ)\Phi(\|\mu\|) - \Phi(\|\mu\| - \epsilon).

Proof Sketch

By symmetry the Bayes-optimal standard classifier is sign(μx)\mathrm{sign}(\mu^\top x) with threshold at zero. An 2\ell_2-robust classifier must classify correctly on the ϵ\epsilon-expansion of each class region. For linear classifiers sign(μxb)\mathrm{sign}(\mu^\top x - b), robustness to δ2ϵ\|\delta\|_2 \leq \epsilon means the signed distance to the hyperplane must exceed ϵ\epsilon on training points of each class. The Bayes-optimal robust linear classifier shifts the threshold by ϵ\epsilon away from the class being defended; averaging the two shifted error probabilities gives Φ(μϵ)\Phi(\|\mu\| - \epsilon).

Why It Matters

This toy illustrates why every proposed defense against adversarial examples typically pays a clean-accuracy cost. The tradeoff is not an artifact of bad defense design; in this distributional family it is baked into the geometry. Practitioners must decide how much clean accuracy they are willing to sacrifice for robustness.

Failure Mode

The statement is for a specific Gaussian mixture with identity covariance and linear decision rules. Real data distributions can have different tradeoff curves, and non-linear classifiers on structured manifolds can in principle be both robust and accurate. Randomized smoothing (Cohen, Rosenfeld, Kolter 2019) gives provably certified 2\ell_2-robust classifiers without assuming this toy setup. The "there exist models that are simultaneously robust and accurate" question remains open for real data.

Defenses

Adversarial training augments the training set with adversarial examples. The objective becomes:

minθE(x,y)[maxδpϵ(fθ(x+δ),y)]\min_\theta \mathbb{E}_{(x,y)} \left[ \max_{\|\delta\|_p \leq \epsilon} \ell(f_\theta(x + \delta), y) \right]

This is the most effective empirical defense but is computationally expensive (requires running PGD at every training step) and incurs the accuracy tradeoff described above.

Certified defenses provide provable guarantees that no perturbation within the threat model can change the prediction. Randomized smoothing (Cohen, Rosenfeld, Kolter 2019) is the most scalable certified defense. Define the smoothed classifier g(x)=argmaxcPηN(0,σ2I)[f(x+η)=c]g(x) = \arg\max_c \mathbb{P}_{\eta \sim \mathcal{N}(0, \sigma^2 I)}[f(x + \eta) = c]. If the top class AA has lower-confidence bound pA\underline{p_A} and the runner-up BB has upper-confidence bound pB\overline{p_B}, then gg is certifiably 2\ell_2-robust at xx with radius:

R=σ2(Φ1(pA)Φ1(pB))R = \frac{\sigma}{2}\left( \Phi^{-1}(\underline{p_A}) - \Phi^{-1}(\overline{p_B}) \right)

where Φ1\Phi^{-1} is the standard-normal quantile function. The certificate is distribution-free in ff; only noise σ\sigma and sampled class margins matter.

Input preprocessing (denoising, compression) and detection methods try to filter adversarial inputs before classification. These have repeatedly been broken by adaptive attacks.

Common Confusions

Watch Out

Adversarial robustness is not the same as input validation

Input validation checks that inputs are well-formed (correct format, in expected range). Adversarial robustness handles inputs that look legitimate but are designed to cause misclassification. A valid JPEG image can be an adversarial example.

Watch Out

Defense evaluation requires adaptive attacks

Many published defenses were broken because they were only tested against non-adaptive attacks (attacks not designed for the specific defense). A defense must be evaluated against an attacker who knows the defense mechanism and adapts accordingly. Obfuscated gradients, for example, can make gradient-based attacks fail without providing real robustness.

Summary

  • Empirically, most high-capacity ML models without adversarial training are vulnerable to p\ell_p perturbations. Randomized smoothing (Cohen-Rosenfeld-Kolter 2019) provides certified robustness for some models and threat models
  • The NIST taxonomy: evasion, poisoning, privacy, abuse
  • PGD attack: iterative FGSM with projection, the standard strong attack
  • Robustness-accuracy tradeoff: fundamental, not an artifact of bad defenses
  • Adversarial training: effective but expensive and accuracy-reducing
  • Always evaluate defenses against adaptive attackers
  • LLM jailbreaks are the evasion attack analog for language models

Exercises

ExerciseCore

Problem

Compute the FGSM perturbation for a linear classifier f(x)=wx+bf(x) = w^\top x + b with squared loss (f(x),y)=(f(x)y)2\ell(f(x), y) = (f(x) - y)^2. What direction does the perturbation point in, and why does this make geometric sense?

ExerciseAdvanced

Problem

Explain why black-box adversarial attacks work despite the attacker having no access to the target model's gradients. What property of adversarial examples enables this?

ExerciseResearch

Problem

The robustness-accuracy tradeoff theorem shows a fundamental tension in the Gaussian mixture setting. Could additional unlabeled data help reduce this tradeoff? Discuss the theoretical arguments and empirical evidence.

References

Foundations (evasion and adversarial examples):

  • Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow, Fergus, "Intriguing Properties of Neural Networks" (ICLR 2014). Original observation that small imperceptible perturbations flip classifier predictions.
  • Goodfellow, Shlens, Szegedy, "Explaining and Harnessing Adversarial Examples" (ICLR 2015). FGSM and the linear explanation.
  • Madry, Makelov, Schmidt, Tsipras, Vladu, "Towards Deep Learning Models Resistant to Adversarial Attacks" (ICLR 2018). PGD attack and min-max adversarial training.
  • Tsipras, Santurkar, Engstrom, Turner, Madry, "Robustness May Be at Odds with Accuracy" (ICLR 2019). Toy-distribution analysis of the robustness-accuracy tradeoff.
  • Schmidt, Santurkar, Tsipras, Talwar, Madry, "Adversarially Robust Generalization Requires More Data" (NeurIPS 2018). Sample-complexity separation.

Evaluation and adaptive attacks:

  • Athalye, Carlini, Wagner, "Obfuscated Gradients Give a False Sense of Security" (ICML 2018). Methodology for breaking defenses that rely on masked gradients.
  • Carlini, Athalye, Papernot, Brendel, Rauber, Tsipras, Goodfellow, Madry, Kurakin, "On Evaluating Adversarial Robustness" (arXiv 2019). Checklist for honest robustness evaluation.
  • Croce, Hein, "Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks" (ICML 2020). AutoAttack.
  • Tramèr, Carlini, Brendel, Madry, "On Adaptive Attacks to Adversarial Example Defenses" (NeurIPS 2020). Thirteen defenses broken by adaptive attacks.
  • Croce et al., "RobustBench: a standardized adversarial robustness benchmark" (NeurIPS Datasets 2021).

Certified robustness:

  • Cohen, Rosenfeld, Kolter, "Certified Adversarial Robustness via Randomized Smoothing" (ICML 2019). The R=(σ/2)(Φ1(pA)Φ1(pB))R = (\sigma/2)(\Phi^{-1}(\underline{p_A}) - \Phi^{-1}(\overline{p_B})) 2\ell_2 certificate.

LLM attack surfaces:

  • Zou, Wang, Kolter, Fredrikson, "Universal and Transferable Adversarial Attacks on Aligned Language Models" (arXiv:2307.15043, 2023). GCG: gradient-based adversarial suffixes.
  • Greshake, Abdelnabi, Mishra, Endres, Holz, Fritz, "Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection" (arXiv:2302.12173, 2023).
  • Anil, Durmus, Panickssery, Sharma, Benton, Kundu, Batson, Rimsky, Tong, Mu, Ford, Mosconi, Agrawal, Schaeffer, Bashkansky, Svenningsen, Lambert, Radhakrishnan, Denison, Hubinger, Bai, Bricken, Maxwell, Schiefer, Sully, Tamkin, Lanham, Nguyen, Korbak, Kaplan, Ganguli, Hatfield-Dodds, Kernion, Conerly, El-Showk, Elhage, Hume, Lovitt, Ndousse, Mercado, DasSarma, Lasenby, Larson, Ringer, Kadavath, Askell, Amodei, "Many-Shot Jailbreaking" (Anthropic, 2024). Extends LLM attack surface to long-context settings.
  • Carlini, Jagielski, Choquette-Choo, Paleka, Pearce, Anderson, Terzis, Thomas, Tramèr, "Poisoning Web-Scale Training Datasets is Practical" (arXiv:2302.10149, 2023).

Standards and taxonomy:

  • NIST AI 100-2 E2023, "Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations" (2024).

Next Topics

The natural next step from adversarial ML:

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

1

Graph-backed continuations