Methodology
ML Project Lifecycle
The full ML project workflow from problem definition through deployment and monitoring. Why most projects fail at data quality, not model architecture. Cross-functional requirements and MLOps basics.
Prerequisites
Why This Matters
Most ML tutorials skip straight to model.fit(). In practice, model
training is roughly 10-20% of the work. The rest is problem definition,
data wrangling, evaluation design, deployment, and monitoring. Projects
fail when practitioners treat ML as a modeling problem instead of an
engineering problem with a modeling component.
Mental Model
An ML project is a pipeline with nine stages. Each stage can fail, and failures in early stages (especially data) propagate forward and corrupt everything downstream.
The Nine Stages
1. Problem Definition
Before writing any code: what decision does the model support? What is the cost of a wrong prediction? What is the baseline (human performance, simple heuristic, existing system)?
A classification model that achieves 95% accuracy is useless if the business requires 99.9% or if a simple rule already gets 94%.
2. Data Collection
Where does the data come from? How is it labeled? What are the selection biases? Is the labeling process reliable?
Common failure modes: labels are noisy (human annotators disagree), the data distribution shifts between collection and deployment, or the dataset is too small for the chosen model class.
3. Exploratory Data Analysis
Look at the data before modeling. Summary statistics, distributions, correlations, class balance, missing value patterns. This step prevents you from training a model on garbage.
4. Feature Engineering
Transform raw data into features the model can use. Domain knowledge matters more than model sophistication here. A good feature set with logistic regression often beats a neural network on raw features.
5. Model Selection
Choose a model class appropriate for the problem. Considerations: data size, feature type (tabular, image, text), latency requirements, interpretability needs. For tabular data with fewer than 10K rows, gradient boosting usually beats deep learning.
6. Training
Fit the model. This includes hyperparameter tuning, regularization choices, and convergence monitoring. Use a validation set to select hyperparameters. Never tune on the test set.
7. Evaluation
Measure performance on a held-out test set. Use metrics aligned with the business objective. Accuracy is rarely the right metric; precision, recall, F1, or calibration error are usually more informative.
8. Deployment
Serve the model in production. This involves model serialization, API design, latency optimization, and infrastructure provisioning. The gap between a Jupyter notebook and a production system is large.
9. Monitoring
Track model performance after deployment. Data distributions shift over time. A model trained on 2023 data may degrade on 2025 data. Detect drift, retrain on schedule, and maintain fallback systems.
Why Most Projects Fail at Data Quality
Label Noise as a Practical Bottleneck
Statement
If both training and evaluation labels are corrupted at rate by the same labeling process, then the measured test accuracy of any classifier is upper-bounded by relative to the noisy labels: even a perfect classifier of the clean distribution disagrees with the noisy labels on an fraction of inputs. Under the standard practice of treating noisy labels as ground truth, increasing model capacity or sample size cannot push observed test error below .
Intuition
If 10% of your held-out labels are wrong and you score against them, you cannot get below 10% disagreement with the held-out set even with the optimal classifier of the underlying truth. Improving label quality on both train and test typically dominates architectural changes.
Proof Sketch
Let be the Bayes classifier of the clean distribution. Under random label corruption at rate , disagrees with the noisy label on a fraction of examples by construction. Any other classifier has clean-population error , and its measured noisy-test error is at least under standard noise models (symmetric or class-conditional with rates summing to less than one).
Why It Matters
This is the practical reason most ML projects should spend a large share of effort on data quality. Switching architectures rarely matters if your evaluation labels themselves are 85% accurate.
Failure Mode
This is not an information-theoretic floor on what is achievable on the clean distribution. Under symmetric or known class-conditional noise, unbiased risk estimators and noise-corrected losses (Natarajan et al. 2013; Patrini et al. 2017) can asymptotically recover the clean Bayes classifier, so the clean-distribution error can be much less than . Under instance-dependent noise the picture is harder and the observable bottleneck above is the right intuition. The bound is also specifically about accuracy; calibration and ranking metrics can behave differently.
Cross-Functional Requirements
An ML system must satisfy requirements beyond accuracy:
- Latency: prediction time per request. Real-time applications need ms. Batch applications can tolerate minutes.
- Throughput: predictions per second. Scales with hardware and batching.
- Cost: compute cost per prediction. Larger models cost more to serve.
- Fairness: performance across demographic groups. A model with 95% overall accuracy but 70% accuracy on a minority group may be unacceptable.
- Privacy: does the model leak training data? Differential privacy and federated learning address this.
MLOps Basics
MLOps
MLOps is the set of practices for deploying and maintaining ML models in production reliably and efficiently. It extends DevOps principles to ML systems, adding version control for data and models, experiment tracking, automated retraining, and model monitoring.
Key MLOps components:
- CI/CD for models: automated testing of model quality on every code change. Tests include data validation, model performance regression tests, and integration tests.
- Model registry: versioned storage of trained models with metadata (training data version, hyperparameters, metrics). Enables rollback.
- A/B testing: serve the new model to a fraction of traffic and compare against the current model on production metrics.
- Feature stores: centralized computation and serving of features, ensuring consistency between training and inference.
Common Confusions
ML projects are not software projects with a model inside
Standard software is deterministic: given the same input, it produces the same output. ML systems are stochastic, data-dependent, and degrade silently. Testing, deployment, and monitoring all require structurally different approaches.
More data is not always better
More noisy data can hurt. More data from a different distribution than your deployment target hurts. Data quality (correct labels, representative distribution) matters more than data quantity beyond a sufficient threshold.
Summary
- Problem definition and data quality determine the ceiling; model choice determines how close you get to it
- Evaluation must use metrics aligned with the actual business objective
- Deployment and monitoring are where most engineering effort goes in production systems
- The full lifecycle is iterative: monitoring reveals problems that send you back to data collection or feature engineering
Exercises
Problem
You are building a fraud detection system. The dataset has 1% fraud cases and 99% legitimate transactions. A model that always predicts "legitimate" achieves 99% accuracy. Why is this model useless, and what metric should you use instead?
Problem
Your model achieves 92% accuracy on the test set, but after deployment, production accuracy drops to 84% within three months. List three possible causes and describe how you would diagnose each.
References
Canonical:
- Sculley et al., "Hidden Technical Debt in Machine Learning Systems" (2015), NeurIPS
- Polyzotis et al., "Data Management Challenges in Production Machine Learning" (2017), SIGMOD
Current:
-
Google, "Rules of ML" (2023), Section on ML system design
-
Huyen, Designing Machine Learning Systems (2022), Chapters 1-3
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14
Next Topics
- Train-test split and data leakage: preventing information contamination
- Exploratory data analysis: understanding your data before modeling
- Experiment tracking and tooling: managing ML experiments systematically
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
1- Hardware for ML Practitionerslayer 1 · tier 2
Derived topics
3- Train-Test Split and Data Leakagelayer 1 · tier 1
- Exploratory Data Analysislayer 1 · tier 2
- Experiment Tracking and Toolinglayer 2 · tier 3
Graph-backed continuations