ML Methods
Data Preprocessing and Feature Engineering
Standardization, scaling, encoding, imputation, and feature selection. Why most algorithms assume centered, scaled inputs and what breaks when you skip preprocessing.
Prerequisites
Why This Matters
Standard preprocessing pipeline: each step fixes a specific assumption violation
Raw data almost never satisfies the assumptions that ML algorithms make. Gradient-based methods assume features are on similar scales. Distance-based methods assume features contribute equally to distance. Tree methods are more robust, but still benefit from clean inputs. Skipping preprocessing is one of the most common causes of poor model performance. Preprocessing is not optional; it is part of the modeling pipeline.
Mental Model
Preprocessing transforms raw features into a form that algorithms can work with efficiently. The three main goals: (1) put features on comparable scales so no single feature dominates, (2) encode non-numeric data as numbers, and (3) handle missing values without introducing bias.
Scaling Methods
Standardization (Z-score Normalization)
Given feature values , standardization transforms each value to:
where is the sample mean and is the sample standard deviation. The result has mean 0 and standard deviation 1.
Min-Max Scaling
Transform feature values to the range :
where and are the observed minimum and maximum. Sensitive to outliers: a single extreme value compresses all other values into a narrow range.
When to use which. Standardization is the default for gradient-based methods (linear regression, logistic regression, neural networks, SVMs). Min-max scaling is used when features must be bounded (e.g., pixel values in [0,1] for image models). Standardization is more robust to outliers because absorbs some of their effect, while min-max scaling is dominated by extremes.
Log Transform and Power Transforms
For right-skewed features (income, population, word frequency), a log transform compresses the long tail and makes the distribution more symmetric. The constant (often 1) handles zeros. This is not cosmetic: many models perform better with approximately symmetric features because the gradient landscape becomes better conditioned.
Box-Cox transform (Box and Cox, 1964): A parametric family of power transforms
that includes log (), square root (), and identity () as special cases. Requires strictly positive inputs. is estimated by maximum likelihood.
Yeo-Johnson transform (Yeo and Johnson, 2000): Extends Box-Cox to allow zero and negative values:
Preferred over Box-Cox when features can be zero or negative.
Encoding Categorical Variables
One-Hot Encoding
For a categorical feature with categories, create binary indicator columns:
where is the -th standard basis vector. Category "red" in becomes .
One-hot encoding introduces degrees of freedom (the -th is linearly dependent). For high-cardinality features (), one-hot encoding creates very sparse, high-dimensional representations.
Target encoding (mean encoding): Replace each category with the mean of the target variable within that category. Formally, for category in feature :
For low-count categories, the raw mean is noisy. A James-Stein-style shrinkage toward the global mean is standard:
where is the count of category , is its empirical mean, is the overall mean, and is a smoothing parameter. Leakage warning: target encoding must be computed using training-fold targets only. Computing it on the full dataset before splitting leaks test-set label information into the training features.
Feature hashing (hashing trick): For extreme-cardinality features (), map category strings to a fixed-dimensional binary vector using a hash function . Formally (Weinberger et al., 2009, ICML), for feature and category , the hashed representation assigns value or to position based on a signed hash . Collisions introduce noise but the bias is bounded. Used in spam filtering, recommendation systems, and any setting where category space changes dynamically.
Feature crosses. For two categorical features (with levels) and (with levels), the cross creates new binary features representing all combinations. This lets linear models capture interactions. Example: "day of week" crossed with "hour" captures that Monday-8am differs from Saturday-8am.
TF-IDF for text features. For a term in document , given total documents and documents containing term :
where is the within-document term frequency (often in log-normalized form). The inverse document frequency down-weights terms that appear in nearly all documents (stop words) and up-weights rare discriminative terms.
Missing Value Imputation
Missing data taxonomy (Rubin, 1976, Biometrika 63:581-592):
- MCAR (Missing Completely At Random): Missingness is independent of both observed and unobserved values. Example: sensor randomly fails. Complete-case analysis (dropping missing rows) is unbiased under MCAR but reduces sample size.
- MAR (Missing At Random): Missingness depends on observed values but not on the missing value itself, conditional on the observed data. Example: income is more likely to be missing for younger respondents, but conditional on age, missingness is random. Ignorable for likelihood-based inference under MAR.
- MNAR (Missing Not At Random): Missingness depends on the missing value itself. Example: high earners are less likely to disclose income. Requires modeling the missingness mechanism; naive imputation is biased.
Standard approaches:
-
Mean/median imputation. Replace missing values with the feature mean (or median for skewed data). Simple and fast. Biased: it underestimates variance and distorts correlations between features. Valid only under MCAR.
-
Model-based imputation. Train a model (e.g., k-NN, random forest) to predict missing values from observed features. Preserves correlations better than mean imputation, but adds complexity and can overfit.
-
Multiple Imputation by Chained Equations (MICE). Van Buuren and Groothuis-Oudshoorn (2011, Journal of Statistical Software 45:1-67). For each feature with missing values, fit a regression model on the other features, draw imputed values from the posterior predictive distribution, and iterate. Produces complete datasets; analyses are pooled using Rubin's combining rules. MICE is the gold standard under MAR: it preserves correlations, provides valid uncertainty estimates, and the pooled inference is asymptotically efficient under MAR.
-
Indicator augmentation. Add a binary "is missing" indicator feature alongside the imputed value. This lets the model learn that missingness itself carries information (data is often not missing at random).
Feature Selection
Three categories:
Filter methods. Rank features by a univariate statistic and keep the top . Common statistics: Pearson correlation with the target, mutual information , or ANOVA F-statistic. Fast, but ignores feature interactions.
Wrapper methods. Evaluate subsets of features by training and testing a model. Forward selection adds features one at a time. Backward elimination removes features one at a time. Computationally expensive: subsets for features in the worst case.
Embedded methods. Feature selection happens during model training. L1 regularization (Lasso) drives coefficients to zero, performing automatic feature selection. The regularization parameter controls sparsity.
Main Theorems
Standardization Improves Gradient Descent Conditioning
Statement
For linear regression with MSE loss, the condition number of the Hessian determines the convergence rate. If features have variances and are uncorrelated, then:
After standardization (all ), for uncorrelated features. Gradient descent converges in one step for , versus steps for condition number .
Intuition
Unstandardized features create an elongated loss landscape. The gradient points toward the minimum along the short axis but barely moves along the long axis. Standardization makes the landscape more spherical, so the gradient points directly toward the minimum.
Proof Sketch
For linear regression with MSE, the Hessian is . If columns of are uncorrelated with variances , then is diagonal with entries . The condition number is . After standardization, all diagonal entries are 1, so . The convergence rate of gradient descent on a quadratic is , which is zero at .
Why It Matters
This explains the common advice to "always standardize your features." It is not a heuristic; it is a direct consequence of optimization theory. Features on different scales create ill-conditioned problems that gradient descent solves slowly or fails to solve at all.
Failure Mode
Standardization helps when features are uncorrelated. If features are highly correlated, the Hessian has small eigenvalues regardless of scaling, and standardization alone does not fix the conditioning. You also need decorrelation (e.g., PCA whitening) or regularization.
Common Confusions
Preprocessing must be fit on training data only
A common data leakage bug: computing the mean and standard deviation on the entire dataset (including test data) before splitting. The test set statistics leak into the training pipeline. Always fit preprocessing parameters (mean, std, min, max) on the training set only, then apply the same transformation to test data. This applies to target encoding as well: computing target means on the full dataset before splitting leaks test labels.
Tree-based models do not need feature scaling
Decision trees split on thresholds within each feature independently. The scale of a feature does not affect where the optimal split is. Random forests and gradient boosting inherit this property. However, trees still benefit from imputation and encoding of categoricals.
More features is not always better
Adding irrelevant features increases dimensionality without improving signal. In high dimensions, distance-based methods suffer from the curse of dimensionality (all points become equidistant). Feature selection or regularization is necessary to remove noise features.
End-to-End Preprocessing Pipeline
The order of preprocessing steps matters. A common pipeline for tabular data:
Step 1: Split first. Partition data into train/validation/test before any preprocessing. This is non-negotiable.
Step 2: Inspect and clean. On the training set only: identify outliers, check for impossible values (negative ages, dates in the future), and verify data types. Remove or cap extreme outliers. Document every cleaning decision.
Step 3: Handle missing values. On the training set: compute imputation statistics (mean, median, or fit a k-NN imputer or MICE). Apply the same imputation to validation and test sets. If missingness exceeds 50% for a feature, consider dropping it. Add binary "is missing" indicators for features where missingness may carry signal. Classify missingness as MCAR/MAR/MNAR where possible to choose the appropriate method.
Step 4: Encode categoricals. For low-cardinality features (): one-hot encoding. For high-cardinality features (): target encoding (using only training set statistics to avoid leakage), feature hashing, or learned embeddings. For ordinal features (e.g., "low/medium/high"): integer encoding preserving the natural order.
Step 5: Transform numerics. Apply log transforms or Yeo-Johnson transforms to right-skewed features. Then standardize all numeric features using training set mean and standard deviation. Apply the same transformation (same and ) to validation and test sets.
Step 6: Feature engineering. Create interaction features (polynomial features, feature crosses) if the model cannot learn interactions (e.g., linear models). For time-series features, compute rolling statistics (mean, standard deviation over a window), ensuring the window only uses past data.
Step 7: Feature selection. Remove features with near-zero variance. Remove one of each pair of highly correlated features (). Optionally, use L1 regularization or mutual information to select the most informative features.
Preprocessing pipeline for house price prediction
Raw features: square footage (numeric, right-skewed), number of bedrooms (numeric, integer), neighborhood (categorical, 45 levels), year built (numeric), has pool (binary), listing description (text).
Pipeline applied to training set:
- Log-transform square footage: reduces skew from 2.3 to 0.1
- One-hot encode neighborhood (45 binary features)
- Standardize numeric features (sqft, bedrooms, year) to zero mean, unit variance
- Impute 3% missing "year built" values with training median (1985)
- Add binary "year_built_missing" indicator
- Create interaction: (captures that the value of extra bedrooms depends on house size)
- Extract TF-IDF features from listing description using (top 100 terms)
Total: 45 (neighborhood) + 5 (numeric) + 1 (missing indicator) + 1 (interaction) + 100 (text) = 152 features from 6 raw features. The pipeline is fit on training data and applied identically to test data.
Summary
- Standardization (zero mean, unit variance) is the default for gradient-based methods
- Min-max scaling for bounded features; log, Box-Cox, or Yeo-Johnson for skewed features
- One-hot encoding for low-cardinality categoricals; target encoding (with leakage guard) or feature hashing for high-cardinality
- Feature crosses for linear-model interactions
- Impute missing values: MCAR/MAR/MNAR taxonomy determines method; MICE is gold standard under MAR
- TF-IDF = term frequency times log inverse document frequency
- Feature selection: filters are fast, wrappers are thorough, L1 regularization is embedded
- Always fit preprocessing on training data only to avoid leakage
- Preprocessing is not optional: it directly affects optimization convergence and model quality
Exercises
Problem
A dataset has two features: age (range 18-90) and income (range 20000-500000). You train a linear regression with gradient descent and find it converges slowly. Estimate the condition number of the Hessian and explain why standardization helps.
Problem
You have a feature with 30% missing values. Compare the bias introduced by mean imputation versus median imputation when the feature distribution is right-skewed with a long tail. Which imputation method is more robust, and why?
References
Canonical:
- Hastie, Tibshirani & Friedman, The Elements of Statistical Learning (2009), Chapter 3.3 (subset selection and feature importance)
- Kuhn & Johnson, Feature Engineering and Selection (2019), Chapters 5-8
- Zheng & Casari, Feature Engineering for Machine Learning (2018)
Missing data:
- Rubin, "Inference and Missing Data" (1976), Biometrika 63:581-592 (MCAR/MAR/MNAR taxonomy)
- van Buuren & Groothuis-Oudshoorn, "mice: Multivariate Imputation by Chained Equations in R" (2011), Journal of Statistical Software 45:1-67 (MICE gold standard)
Feature encoding and hashing:
- Weinberger et al., "Feature Hashing for Large Scale Multitask Learning" (2009), ICML (hashing trick)
Power transforms:
- Box & Cox, "An Analysis of Transformations" (1964), Journal of the Royal Statistical Society, Series B, 26:211-252
- Yeo & Johnson, "A New Family of Power Transformations to Improve Normality or Symmetry" (2000), Biometrika 87:954-959
Last reviewed: April 18, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Common Probability Distributionslayer 0A · tier 1
- Linear Regressionlayer 1 · tier 1
Derived topics
1- Logistic Regressionlayer 1 · tier 1
Graph-backed continuations