Skip to main content

ML Methods

Support Vector Machines

Maximum-margin classifiers via convex optimization: hard margin, soft margin with slack variables, hinge loss, the dual formulation, and the kernel trick.

CoreTier 1StableCore spine~70 min

Why This Matters

Maximum-margin hyperplane $w^\top x + b = 0$ separating two classes. Filled markers are support vectors (points on the margin gutters with $\alpha_i > 0$). One soft-margin violator is shown.

-3-2-10123-3-2-101232 / ‖w‖ξᵢy = +1y = −1w · x + b = 0SVM anatomydecision boundarymargin gutters ±1bulk training pointsupport vector (αᵢ > 0)soft-margin violator (ξᵢ > 0)Geometrymargin = 2 / ‖w‖Maximizing the marginis equivalent to minimizing‖w‖² subject to labelconstraints.Dual variablesBy KKT complementaryslackness, αᵢ > 0 onlywhen yᵢ(w·xᵢ+b) = 1.

Support vector machines were the dominant classification method from the mid-1990s through the early 2010s, and for good reason: they have a clean mathematical formulation rooted in convex optimization, come with strong generalization guarantees via margin theory, and extend to nonlinear classification through the kernel trick. Even in the deep learning era, SVMs remain the clearest example of how optimization geometry connects to statistical generalization.

Mental Model

You have two classes of points in Rd\mathbb{R}^d. You want to find a separating hyperplane that is as far as possible from both classes. The "margin" is the width of the gap between the classes. Maximizing the margin is equivalent to minimizing w\|w\| subject to all points being correctly classified. This is a convex quadratic program, and its dual reveals that the solution depends only on dot products between data points — opening the door to kernels.

Hard Margin SVM

Definition

Separating Hyperplane

A hyperplane in Rd\mathbb{R}^d defined by weight vector wRdw \in \mathbb{R}^d and bias bRb \in \mathbb{R}. A point xix_i with label yi{1,+1}y_i \in \{-1, +1\} is correctly classified if and only if yi(wTxi+b)>0y_i(w^T x_i + b) > 0.

Definition

Functional Margin

The functional margin of a point (xi,yi)(x_i, y_i) is yi(wTxi+b)y_i(w^T x_i + b). Correct classification means positive functional margin. The geometric margin is the functional margin divided by w\|w\|, giving the actual Euclidean distance from the point to the hyperplane.

Definition

Hard Margin SVM

For linearly separable data {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n, the hard margin SVM solves:

minw,b12w2s.t.yi(wTxi+b)1    i\min_{w, b} \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^T x_i + b) \geq 1 \;\; \forall i

The margin (distance between the two class boundaries) is 2w\frac{2}{\|w\|}. Maximizing this margin is equivalent to minimizing w2\|w\|^2.

The Dual Formulation

Theorem

Hard Margin SVM Dual Problem

Statement

The Lagrangian dual of the hard margin SVM is:

maxαi=1nαi12i,jαiαjyiyjxiTxj\max_{\alpha} \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j

subject to αi0\alpha_i \geq 0 for all ii and iαiyi=0\sum_i \alpha_i y_i = 0.

At optimality, w=iαiyixiw^* = \sum_i \alpha_i^* y_i x_i, and αi>0\alpha_i^* > 0 only for support vectors: points lying exactly on the margin boundary.

Intuition

The dual says: find weights αi\alpha_i on training points such that the resulting classifier w=αiyixiw = \sum \alpha_i y_i x_i maximizes margin. Most αi\alpha_i are zero. Only the points closest to the decision boundary (the support vectors) have nonzero weight, so the classifier depends on only a few critical training examples.

Proof Sketch

Form the Lagrangian L(w,b,α)=12w2iαi[yi(wTxi+b)1]L(w, b, \alpha) = \frac{1}{2}\|w\|^2 - \sum_i \alpha_i[y_i(w^T x_i + b) - 1]. Set wL=0\nabla_w L = 0 to get w=iαiyixiw = \sum_i \alpha_i y_i x_i. Set L/b=0\partial L / \partial b = 0 to get iαiyi=0\sum_i \alpha_i y_i = 0. Substitute back into LL to eliminate ww and bb, yielding the dual. Strong duality holds because the primal is convex and Slater's condition is satisfied (any strictly feasible point suffices).

Why It Matters

The dual formulation is essential for two reasons: (1) the data appears only through dot products xiTxjx_i^T x_j, enabling the kernel trick, and (2) the dual variables directly identify the support vectors.

Failure Mode

The hard margin SVM has no solution if the data is not linearly separable. Even a single mislabeled point or outlier makes the primal infeasible. This motivates the soft margin formulation.

Lagrange Multipliers and KKT — Quick Recap

SVM is the canonical setting for seeing Lagrange multipliers and the KKT conditions in action. A short self-contained refresher, since these concepts do not have standalone topic pages yet.

A constrained convex problem minxf(x) s.t. gi(x)0\min_x f(x) \text{ s.t. } g_i(x) \le 0 has Lagrangian L(x,α)=f(x)+iαigi(x)L(x, \alpha) = f(x) + \sum_i \alpha_i g_i(x) with αi0\alpha_i \ge 0. The dual function g(α)=minxL(x,α)g(\alpha) = \min_x L(x, \alpha) is concave; its supremum equals ff^* at the primal optimum under regularity (Slater's condition: a strictly feasible point exists). The four KKT conditions characterize that joint optimum:

  1. Stationarity (gradient of Lagrangian vanishes): xf(x)+iαigi(x)=0\nabla_x f(x^*) + \sum_i \alpha_i^* \nabla g_i(x^*) = 0
  2. Primal feasibility: gi(x)0g_i(x^*) \le 0 for every ii
  3. Dual feasibility: αi0\alpha_i^* \ge 0 for every ii
  4. Complementary slackness: αigi(x)=0\alpha_i^* g_i(x^*) = 0 for every iieach constraint is either tight (gi(x)=0g_i(x^*) = 0) or its multiplier is zero (αi=0\alpha_i^* = 0).

For the hard-margin SVM with constraints gi(w,b)=1yi(wxi+b)0g_i(w, b) = 1 - y_i(w^\top x_i + b) \le 0, the KKT conditions specialize to:

  1. Stationarity: w=iαiyixiw^* = \sum_i \alpha_i^* y_i x_i (and iαiyi=0\sum_i \alpha_i^* y_i = 0 from L/b\partial L / \partial b).
  2. Primal feasibility: yi(wxi+b)1y_i(w^{*\top} x_i + b^*) \ge 1 for every training point.
  3. Dual feasibility: αi0\alpha_i^* \ge 0.
  4. Complementary slackness: αi[yi(wxi+b)1]=0\alpha_i^*[y_i(w^{*\top} x_i + b^*) - 1] = 0 — either αi=0\alpha_i^* = 0 (the point is not a support vector) or yi(wxi+b)=1y_i(w^{*\top} x_i + b^*) = 1 (the point sits exactly on the margin gutter). No middle ground.

The dual-variable interpretation is what makes SVM both interpretable and sparse: only the support vectors contribute to ww^*. The complementary-slackness identity is the operational rule that picks them out.

Soft Margin SVM and Hinge Loss

Definition

Soft Margin SVM

When data is not linearly separable, introduce slack variables ξi0\xi_i \geq 0:

minw,b,ξ12w2+Ci=1nξis.t.yi(wTxi+b)1ξi,    ξi0\min_{w, b, \xi} \frac{1}{2}\|w\|^2 + C\sum_{i=1}^n \xi_i \quad \text{s.t.} \quad y_i(w^T x_i + b) \geq 1 - \xi_i, \;\; \xi_i \geq 0

The parameter C>0C > 0 trades off margin width against margin violations. Large CC penalizes violations heavily (close to hard margin); small CC allows more violations for a wider margin.

Definition

Hinge Loss

The soft margin SVM is equivalent to minimizing the hinge loss:

minwλ2w2+1ni=1nmax(0,1yi(wTxi+b))\min_w \frac{\lambda}{2}\|w\|^2 + \frac{1}{n}\sum_{i=1}^n \max(0, 1 - y_i(w^T x_i + b))

where λ=1/(nC)\lambda = 1/(nC). The hinge loss is convex, non-differentiable at yf(x)=1y \cdot f(x) = 1, and zero for correctly classified points with margin 1\geq 1. This is a regularized ERM problem with hinge loss.

The Kernel Trick

Since the dual depends on data only through dot products xiTxjx_i^T x_j, we can replace these with a kernel function k(xi,xj)=ϕ(xi)Tϕ(xj)k(x_i, x_j) = \phi(x_i)^T \phi(x_j) where ϕ\phi maps to a (potentially infinite-dimensional) feature space. The classifier becomes:

f(x)=i=1nαiyik(xi,x)+bf(x) = \sum_{i=1}^n \alpha_i y_i k(x_i, x) + b

We never compute ϕ(x)\phi(x) explicitly. We only evaluate the kernel. Four canonical kernels cover the bulk of practice:

KernelFormulaFeature-space dimensionWhere it earns its keep
Lineark(x,z)=xzk(x, z) = x^\top zdd (no lift)High-dimensional sparse data (text, bag-of-words); when interpretability of ww matters.
Polynomial (degree pp)k(x,z)=(xz+c)pk(x, z) = (x^\top z + c)^p(d+pp)\binom{d + p}{p}Explicit interaction terms up to degree pp; small dd with structured nonlinearity.
Gaussian RBFk(x,z)=exp(xz2/(2σ2))k(x, z) = \exp(-\|x - z\|^2 / (2\sigma^2))infiniteDefault for most problems; smooth boundaries; bandwidth σ\sigma is the dominant tuning knob.
Sigmoid (tanh)k(x,z)=tanh(κxz+c)k(x, z) = \tanh(\kappa\, x^\top z + c)not p.s.d. for all κ,c\kappa, cHistorical neural-network analogy; rarely best in practice and not always Mercer.

The RBF kernel maps to an infinite-dimensional feature space yet computes in O(d)O(d) per pair, which is what made the kernel trick a practical revolution. The kernels and RKHS page gives the full Mercer / reproducing-kernel theory underneath.

Sensitivity to CC and σ\sigma

SVM performance is famously parameter-sensitive. Two knobs dominate, and miscalibrating either can flip the model from underfit to overfit:

  • CC (soft-margin penalty). Large CC heavily penalizes margin violations; the model approaches the hard-margin SVM, narrowing the margin and overfitting outliers. Small CC tolerates violations, widening the margin and behaving like a heavily regularized classifier (potentially underfitting). The mapping λ=1/(nC)\lambda = 1 / (n C) ties CC to the regularization strength of the equivalent hinge-loss-plus-2\ell_2 ERM.
  • σ\sigma (RBF bandwidth). Large σ\sigma makes k(x,z)k(x, z) vary slowly with distance, producing smooth, low-capacity decision functions (underfit). Small σ\sigma makes k(x,z)k(x, z) near-zero for any pair of distinct points, producing a near-nearest-neighbor classifier with high capacity (overfit; the algorithmic stability bound becomes loose).

The standard discipline is grid search over logC\log C and logσ\log \sigma with cross-validation, evaluated on a hold-out set. The classical sweet spot for RBF SVMs on a normalized dataset is C[101,102]C \in [10^{-1}, 10^{2}] and σmedian pairwise distance\sigma \approx \text{median pairwise distance} ("median heuristic"). When the median heuristic disagrees sharply with the cross-validated optimum, the data has multi-scale structure that single-bandwidth RBF cannot capture; that is when polynomial or composite kernels earn their place.

Bousquet and Elisseeff (2002, Theorem 22) gave SVM an explicit stability bound β=L2/(2μn)\beta = L^2 / (2\mu n), where μ\mu is the strong-convexity constant of the regularizer (here 2/C2/C) and LL is the Lipschitz constant of the loss. This is one of the cleanest examples of a stability-based generalization bound in the canonical-algorithm setting, and the rate 1/(μn)1/(\mu n) scales linearly with CC.

Theorem

Representer Theorem for SVMs

Statement

For any regularized risk minimization problem of the form minfHkλ2fHk2+1ni(yi,f(xi))\min_{f \in \mathcal{H}_k} \frac{\lambda}{2}\|f\|_{\mathcal{H}_k}^2 + \frac{1}{n}\sum_i \ell(y_i, f(x_i)), the minimizer has the representation f(x)=i=1nαik(xi,x)f^*(x) = \sum_{i=1}^n \alpha_i k(x_i, x).

Intuition

You never need to search over the full (infinite-dimensional) RKHS. The optimal function is always a finite linear combination of kernel evaluations at the training points. This is why the kernel trick is computationally feasible.

Proof Sketch

Decompose f=f+ff = f_\parallel + f_\perp where ff_\parallel lies in span{k(xi,)}\text{span}\{k(x_i, \cdot)\} and ff_\perp is orthogonal. Then f(xi)=f(xi)f(x_i) = f_\parallel(x_i) for all training points (by reproducing property), but f2=f2+f2f2\|f\|^2 = \|f_\parallel\|^2 + \|f_\perp\|^2 \geq \|f_\parallel\|^2. So ff_\perp only increases the regularizer without affecting the loss.

Why It Matters

The representer theorem reduces an infinite-dimensional optimization to a finite-dimensional one (finding nn coefficients αi\alpha_i), making kernel methods computationally tractable.

Failure Mode

The kernel matrix Kij=k(xi,xj)K_{ij} = k(x_i, x_j) is n×nn \times n. For large nn, storing and inverting this matrix becomes prohibitive (O(n2)O(n^2) memory, O(n3)O(n^3) solve time). This is why SVMs do not scale as well as SGD-based methods to very large datasets.

Connection to Regularization

The soft margin SVM is equivalent to regularized ERM with hinge loss:

minwλ2w2+1ni=1nhinge(yi,wTxi)\min_w \frac{\lambda}{2}\|w\|^2 + \frac{1}{n}\sum_{i=1}^n \ell_{\text{hinge}}(y_i, w^T x_i)

This is the same framework as ridge regression (squared loss + L2 regularization) or regularized logistic regression (logistic loss + L2). The only difference is the loss function. The hinge loss is special because it creates sparsity in the dual: most αi=0\alpha_i = 0, giving the "support vector" property.

Canonical Examples

Example

Hard margin in 2D

Consider two classes: {(1,1),(2,2)}\{(1,1), (2,2)\} with label +1+1 and {(1,1),(2,2)}\{(-1,-1), (-2,-2)\} with label 1-1. The optimal separating hyperplane passes through the origin with w(1,1)w \propto (1,1). The margin is 2w\frac{2}{\|w\|} and the support vectors are (1,1)(1,1) and (1,1)(-1,-1), the closest points to the boundary from each class.

Common Confusions

Watch Out

SVMs do not maximize accuracy. THEY maximize margin

An SVM does not directly maximize classification accuracy on the training set. Many hyperplanes may achieve 100% training accuracy on separable data. The SVM picks the one with the largest margin. The theoretical justification is that large margin implies small generalization error (via VC dimension bounds on the margin classifier class).

Watch Out

The kernel trick is not specific to SVMs

Any algorithm whose computation depends only on dot products between data points can be kernelized. This includes kernel PCA, kernel ridge regression, and kernel k-means. SVMs popularized the kernel trick, but it is a general principle.

Why SVMs Were Dominant Pre-Deep-Learning

From roughly 1995 to 2012, SVMs were the go-to method for classification:

  • Convex optimization: a single global optimum, no local minima
  • Strong theory: margin-based generalization bounds
  • Kernels: nonlinear classification without manual feature engineering
  • Sparsity: the solution depends on few support vectors

Deep learning overtook SVMs because neural networks scale better to massive datasets (SGD is O(n)O(n) per epoch vs O(n2)O(n^2)-O(n3)O(n^3) for kernel SVMs) and learn hierarchical features automatically.

Summary

  • Hard margin SVM: minimize w2\|w\|^2 subject to yi(wTxi+b)1y_i(w^T x_i + b) \geq 1
  • Margin = 2/w2/\|w\|; maximizing margin = minimizing w\|w\|
  • Dual depends on data only through dot products \to kernel trick
  • Support vectors: points with αi>0\alpha_i > 0, lying on the margin
  • Soft margin: slack variables ξi\xi_i, parameter CC controls tradeoff
  • Hinge loss max(0,1yf(x))\max(0, 1 - yf(x)): convex surrogate for 0-1 loss

Exercises

ExerciseCore

Problem

Derive the dual of the hard margin SVM. Start from the primal minw,b12w2\min_{w,b} \frac{1}{2}\|w\|^2 subject to yi(wTxi+b)1y_i(w^T x_i + b) \geq 1, introduce Lagrange multipliers αi0\alpha_i \geq 0, and eliminate ww and bb.

ExerciseAdvanced

Problem

Show that for the soft margin SVM, the dual variables satisfy 0αiC0 \leq \alpha_i \leq C. What does αi=C\alpha_i = C mean geometrically?

Related Comparisons

References

Canonical:

  • Vapnik, The Nature of Statistical Learning Theory (1995), Chapters 5 and 10
  • Cristianini & Shawe-Taylor, An Introduction to Support Vector Machines (2000)
  • Cortes & Vapnik, "Support-Vector Networks" (Machine Learning, 1995)
  • Boser, Guyon, Vapnik, "A Training Algorithm for Optimal Margin Classifiers" (COLT, 1992)
  • Schoelkopf & Smola, Learning with Kernels (2002), Chapters 2 and 7

Current:

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 15-16
  • Boyd & Vandenberghe, Convex Optimization (2004), Chapter 5
  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 12
  • Steinwart & Christmann, Support Vector Machines (2008), Chapters 1 and 2

Next Topics

The natural next step from SVMs:

Last reviewed: May 3, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

5

Derived topics

3