Data Preprocessing and Feature Engineering

Sneiderman, Robby

ML Methods

Data Preprocessing and Feature Engineering

Standardization, scaling, encoding, imputation, and feature selection. Why most algorithms assume centered, scaled inputs and what breaks when you skip preprocessing.

CoreTier 1StableSupporting~45 min

Prerequisites

Common Probability Distributions Linear Regression

Quiz (4)Prereq Map

Why This Matters

Standard preprocessing pipeline: each step fixes a specific assumption violation

Raw data almost never satisfies the assumptions that ML algorithms make. Gradient-based methods assume features are on similar scales. Distance-based methods assume features contribute equally to distance. Tree methods are more robust, but still benefit from clean inputs. Skipping preprocessing is one of the most common causes of poor model performance. Preprocessing is not optional; it is part of the modeling pipeline.

Mental Model

Preprocessing transforms raw features into a form that algorithms can work with efficiently. The three main goals: (1) put features on comparable scales so no single feature dominates, (2) encode non-numeric data as numbers, and (3) handle missing values without introducing bias.

Scaling Methods

Definition

Standardization (Z-score Normalization)

Given feature values $x_1, \ldots, x_n$ , standardization transforms each value to:

$z_i = \frac{x_i - \bar{x}}{s}$

where $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$ is the sample mean and $s = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2}$ is the sample standard deviation. The result has mean 0 and standard deviation 1.

Definition

Min-Max Scaling

Transform feature values to the range $[0, 1]$ :

$z_i = \frac{x_i - x_{\min}}{x_{\max} - x_{\min}}$

where $x_{\min}$ and $x_{\max}$ are the observed minimum and maximum. Sensitive to outliers: a single extreme value compresses all other values into a narrow range.

When to use which. Standardization is the default for gradient-based methods (linear regression, logistic regression, neural networks, SVMs). Min-max scaling is used when features must be bounded (e.g., pixel values in [0,1] for image models). Standardization is more robust to outliers because $s$ absorbs some of their effect, while min-max scaling is dominated by extremes.

Log Transform and Power Transforms

For right-skewed features (income, population, word frequency), a log transform $z_i = \log(x_i + c)$ compresses the long tail and makes the distribution more symmetric. The constant $c$ (often 1) handles zeros. This is not cosmetic: many models perform better with approximately symmetric features because the gradient landscape becomes better conditioned.

Box-Cox transform (Box and Cox, 1964): A parametric family of power transforms

$z_i(\lambda) = \begin{cases} (x_i^\lambda - 1)/\lambda & \lambda \neq 0 \\ \log x_i & \lambda = 0 \end{cases}$

that includes log ( $\lambda = 0$ ), square root ( $\lambda = 0.5$ ), and identity ( $\lambda = 1$ ) as special cases. Requires strictly positive inputs. $\lambda$ is estimated by maximum likelihood.

Yeo-Johnson transform (Yeo and Johnson, 2000): Extends Box-Cox to allow zero and negative values:

$z_i(\lambda) = \begin{cases} ((x_i + 1)^\lambda - 1)/\lambda & x_i \geq 0, \lambda \neq 0 \\ \log(x_i + 1) & x_i \geq 0, \lambda = 0 \\ -((-x_i + 1)^{2-\lambda} - 1)/(2-\lambda) & x_i < 0, \lambda \neq 2 \\ -\log(-x_i + 1) & x_i < 0, \lambda = 2 \end{cases}$

Preferred over Box-Cox when features can be zero or negative.

Encoding Categorical Variables

Definition

One-Hot Encoding

For a categorical feature with $K$ categories, create $K$ binary indicator columns:

$\text{onehot}(x) = e_k \in \{0, 1\}^K$

where $e_k$ is the $k$ -th standard basis vector. Category "red" in becomes $[1, 0, 0]$ .

One-hot encoding introduces $K-1$ degrees of freedom (the $K$ -th is linearly dependent). For high-cardinality features ( $K > 100$ ), one-hot encoding creates very sparse, high-dimensional representations.

Target encoding (mean encoding): Replace each category with the mean of the target variable within that category. Formally, for category $c$ in feature $j$ :

$\hat{x}_{j,c} = \frac{\sum_{i: x_{ij}=c} y_i}{|\{i: x_{ij}=c\}|}$

For low-count categories, the raw mean is noisy. A James-Stein-style shrinkage toward the global mean is standard:

$\hat{x}_{j,c}^{\text{shrunk}} = \frac{n_c \bar{y}_c + \alpha \bar{y}_{\text{global}}}{n_c + \alpha}$

where $n_c$ is the count of category $c$ , $\bar{y}_c$ is its empirical mean, $\bar{y}_{\text{global}}$ is the overall mean, and $\alpha$ is a smoothing parameter. Leakage warning: target encoding must be computed using training-fold targets only. Computing it on the full dataset before splitting leaks test-set label information into the training features.

Feature hashing (hashing trick): For extreme-cardinality features ( $K > 10^4$ ), map category strings to a fixed-dimensional binary vector using a hash function $h(c) \mod d$ . Formally (Weinberger et al., 2009, ICML), for feature $j$ and category $c$ , the hashed representation assigns value $+1$ or $-1$ to position $h(c) \mod d$ based on a signed hash $\xi(c) \in \{-1, +1\}$ . Collisions introduce noise but the bias is bounded. Used in spam filtering, recommendation systems, and any setting where category space changes dynamically.

Feature crosses. For two categorical features $A$ (with $K_A$ levels) and $B$ (with $K_B$ levels), the cross $A \times B$ creates $K_A \times K_B$ new binary features representing all combinations. This lets linear models capture interactions. Example: "day of week" crossed with "hour" captures that Monday-8am differs from Saturday-8am.

TF-IDF for text features. For a term $t$ in document $d$ , given $N$ total documents and $\text{df}(t)$ documents containing term $t$ :

$\text{tfidf}(t, d) = \text{tf}(t, d) \cdot \log\!\left(\frac{N}{\text{df}(t)}\right)$

where $\text{tf}(t, d)$ is the within-document term frequency (often $1 + \log(\text{count}(t,d))$ in log-normalized form). The inverse document frequency $\log(N/\text{df}(t))$ down-weights terms that appear in nearly all documents (stop words) and up-weights rare discriminative terms.

Missing Value Imputation

Missing data taxonomy (Rubin, 1976, Biometrika 63:581-592):

MCAR (Missing Completely At Random): Missingness is independent of both observed and unobserved values. Example: sensor randomly fails. Complete-case analysis (dropping missing rows) is unbiased under MCAR but reduces sample size.
MAR (Missing At Random): Missingness depends on observed values but not on the missing value itself, conditional on the observed data. Example: income is more likely to be missing for younger respondents, but conditional on age, missingness is random. Ignorable for likelihood-based inference under MAR.
MNAR (Missing Not At Random): Missingness depends on the missing value itself. Example: high earners are less likely to disclose income. Requires modeling the missingness mechanism; naive imputation is biased.

Standard approaches:

Mean/median imputation. Replace missing values with the feature mean (or median for skewed data). Simple and fast. Biased: it underestimates variance and distorts correlations between features. Valid only under MCAR.
Model-based imputation. Train a model (e.g., k-NN, random forest) to predict missing values from observed features. Preserves correlations better than mean imputation, but adds complexity and can overfit.
Multiple Imputation by Chained Equations (MICE). Van Buuren and Groothuis-Oudshoorn (2011, Journal of Statistical Software 45:1-67). For each feature with missing values, fit a regression model on the other features, draw imputed values from the posterior predictive distribution, and iterate. Produces $m$ complete datasets; analyses are pooled using Rubin's combining rules. MICE is the gold standard under MAR: it preserves correlations, provides valid uncertainty estimates, and the pooled inference is asymptotically efficient under MAR.
Indicator augmentation. Add a binary "is missing" indicator feature alongside the imputed value. This lets the model learn that missingness itself carries information (data is often not missing at random).

Feature Selection

Three categories:

Filter methods. Rank features by a univariate statistic and keep the top $k$ . Common statistics: Pearson correlation with the target, mutual information $I(X_j; Y)$ , or ANOVA F-statistic. Fast, but ignores feature interactions.

Wrapper methods. Evaluate subsets of features by training and testing a model. Forward selection adds features one at a time. Backward elimination removes features one at a time. Computationally expensive: $O(2^p)$ subsets for $p$ features in the worst case.

Embedded methods. Feature selection happens during model training. L1 regularization (Lasso) drives coefficients to zero, performing automatic feature selection. The regularization parameter $\lambda$ controls sparsity.

Main Theorems

Proposition

Standardization Improves Gradient Descent Conditioning

Statement

For linear regression $\hat{y} = w^T x$ with MSE loss, the condition number of the Hessian $H = X^T X / n$ determines the convergence rate. If features have variances $\sigma_1^2, \ldots, \sigma_p^2$ and are uncorrelated, then:

$\kappa(H) = \frac{\sigma_{\max}^2}{\sigma_{\min}^2}$

After standardization (all $\sigma_j = 1$ ), $\kappa(H) = 1$ for uncorrelated features. Gradient descent converges in one step for $\kappa = 1$ , versus $O(\kappa \log(1/\epsilon))$ steps for condition number $\kappa$ .

Intuition

Unstandardized features create an elongated loss landscape. The gradient points toward the minimum along the short axis but barely moves along the long axis. Standardization makes the landscape more spherical, so the gradient points directly toward the minimum.

Proof Sketch

For linear regression with MSE, the Hessian is $H = X^T X / n$ . If columns of $X$ are uncorrelated with variances $\sigma_j^2$ , then $H$ is diagonal with entries $\sigma_j^2$ . The condition number is $\max_j \sigma_j^2 / \min_j \sigma_j^2$ . After standardization, all diagonal entries are 1, so $\kappa = 1$ . The convergence rate of gradient descent on a quadratic is $((\kappa - 1)/(\kappa + 1))^t$ , which is zero at $\kappa = 1$ .

Why It Matters

This explains the common advice to "always standardize your features." It is not a heuristic; it is a direct consequence of optimization theory. Features on different scales create ill-conditioned problems that gradient descent solves slowly or fails to solve at all.

Failure Mode

Standardization helps when features are uncorrelated. If features are highly correlated, the Hessian has small eigenvalues regardless of scaling, and standardization alone does not fix the conditioning. You also need decorrelation (e.g., PCA whitening) or regularization.

report a correction →

Common Confusions

Watch Out

Preprocessing must be fit on training data only

A common data leakage bug: computing the mean and standard deviation on the entire dataset (including test data) before splitting. The test set statistics leak into the training pipeline. Always fit preprocessing parameters (mean, std, min, max) on the training set only, then apply the same transformation to test data. This applies to target encoding as well: computing target means on the full dataset before splitting leaks test labels.

Watch Out

Tree-based models do not need feature scaling

Decision trees split on thresholds within each feature independently. The scale of a feature does not affect where the optimal split is. Random forests and gradient boosting inherit this property. However, trees still benefit from imputation and encoding of categoricals.

Watch Out

More features is not always better

Adding irrelevant features increases dimensionality without improving signal. In high dimensions, distance-based methods suffer from the curse of dimensionality (all points become equidistant). Feature selection or regularization is necessary to remove noise features.

End-to-End Preprocessing Pipeline

The order of preprocessing steps matters. A common pipeline for tabular data:

Step 1: Split first. Partition data into train/validation/test before any preprocessing. This is non-negotiable.

Step 2: Inspect and clean. On the training set only: identify outliers, check for impossible values (negative ages, dates in the future), and verify data types. Remove or cap extreme outliers. Document every cleaning decision.

Step 3: Handle missing values. On the training set: compute imputation statistics (mean, median, or fit a k-NN imputer or MICE). Apply the same imputation to validation and test sets. If missingness exceeds 50% for a feature, consider dropping it. Add binary "is missing" indicators for features where missingness may carry signal. Classify missingness as MCAR/MAR/MNAR where possible to choose the appropriate method.

Step 4: Encode categoricals. For low-cardinality features ( $K < 20$ ): one-hot encoding. For high-cardinality features ( $K > 100$ ): target encoding (using only training set statistics to avoid leakage), feature hashing, or learned embeddings. For ordinal features (e.g., "low/medium/high"): integer encoding preserving the natural order.

Step 5: Transform numerics. Apply log transforms or Yeo-Johnson transforms to right-skewed features. Then standardize all numeric features using training set mean and standard deviation. Apply the same transformation (same $\mu$ and $\sigma$ ) to validation and test sets.

Step 6: Feature engineering. Create interaction features (polynomial features, feature crosses) if the model cannot learn interactions (e.g., linear models). For time-series features, compute rolling statistics (mean, standard deviation over a window), ensuring the window only uses past data.

Step 7: Feature selection. Remove features with near-zero variance. Remove one of each pair of highly correlated features ( $|r| > 0.95$ ). Optionally, use L1 regularization or mutual information to select the most informative features.

Example

Preprocessing pipeline for house price prediction

Raw features: square footage (numeric, right-skewed), number of bedrooms (numeric, integer), neighborhood (categorical, 45 levels), year built (numeric), has pool (binary), listing description (text).

Pipeline applied to training set:

Log-transform square footage: $\log(\text{sqft})$ reduces skew from 2.3 to 0.1
One-hot encode neighborhood (45 binary features)
Standardize numeric features (sqft, bedrooms, year) to zero mean, unit variance
Impute 3% missing "year built" values with training median (1985)
Add binary "year_built_missing" indicator
Create interaction: $\log(\text{sqft}) \times \text{bedrooms}$ (captures that the value of extra bedrooms depends on house size)
Extract TF-IDF features from listing description using $\text{tfidf}(t,d) = \text{tf}(t,d) \cdot \log(N/\text{df}(t))$ (top 100 terms)

Total: 45 (neighborhood) + 5 (numeric) + 1 (missing indicator) + 1 (interaction) + 100 (text) = 152 features from 6 raw features. The pipeline is fit on training data and applied identically to test data.

Summary

Standardization (zero mean, unit variance) is the default for gradient-based methods
Min-max scaling for bounded features; log, Box-Cox, or Yeo-Johnson for skewed features
One-hot encoding for low-cardinality categoricals; target encoding (with leakage guard) or feature hashing for high-cardinality
Feature crosses for linear-model interactions
Impute missing values: MCAR/MAR/MNAR taxonomy determines method; MICE is gold standard under MAR
TF-IDF = term frequency times log inverse document frequency
Feature selection: filters are fast, wrappers are thorough, L1 regularization is embedded
Always fit preprocessing on training data only to avoid leakage
Preprocessing is not optional: it directly affects optimization convergence and model quality

Exercises

ExerciseCore

Problem

A dataset has two features: age (range 18-90) and income (range 20000-500000). You train a linear regression with gradient descent and find it converges slowly. Estimate the condition number of the Hessian and explain why standardization helps.

ExerciseAdvanced

Problem

You have a feature with 30% missing values. Compare the bias introduced by mean imputation versus median imputation when the feature distribution is right-skewed with a long tail. Which imputation method is more robust, and why?

References

Canonical:

Hastie, Tibshirani & Friedman, The Elements of Statistical Learning (2009), Chapter 3.3 (subset selection and feature importance)
Kuhn & Johnson, Feature Engineering and Selection (2019), Chapters 5-8
Zheng & Casari, Feature Engineering for Machine Learning (2018)

Missing data:

Rubin, "Inference and Missing Data" (1976), Biometrika 63:581-592 (MCAR/MAR/MNAR taxonomy)
van Buuren & Groothuis-Oudshoorn, "mice: Multivariate Imputation by Chained Equations in R" (2011), Journal of Statistical Software 45:1-67 (MICE gold standard)

Feature encoding and hashing:

Weinberger et al., "Feature Hashing for Large Scale Multitask Learning" (2009), ICML (hashing trick)

Power transforms:

Box & Cox, "An Analysis of Transformations" (1964), Journal of the Royal Statistical Society, Series B, 26:211-252
Yeo & Johnson, "A New Family of Power Transformations to Improve Normality or Symmetry" (2000), Biometrika 87:954-959

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Common Probability Distributionslayer 0A · tier 1
Linear Regressionlayer 1 · tier 1

Derived topics

1

Logistic Regressionlayer 1 · tier 1

Graph-backed continuations

Logistic Regression