Skip to content

ML System Fundamentals

TL;DR

Machine learning systems are software systems whose behavior depends on code, data, features, labels, model artifacts, and feedback loops. The core design problem is not "train a model"; it is keeping training, serving, monitoring, and retraining aligned as the world changes. Treat data and model artifacts as versioned production dependencies with the same rigor as code.


The ML System Boundary

Traditional services usually ship code and configuration. ML systems ship a decision function produced by a pipeline.

The feedback loop is the difference. A recommender changes what users see, which changes what they click, which changes future training data. A fraud model blocks transactions, which changes the observed label distribution. A ranking model shifts traffic toward items it already believes are good.

This section covers classic predictive ML (tabular, ranking, vision, fraud). For LLM-based systems — agents, RAG, inference serving, evaluation — see LLM Systems; the lifecycle discipline here (versioning, monitoring, rollback) applies to both, while the serving economics diverge sharply (LLM Infrastructure).


Production ML Components

ComponentOwnsCommon failure
Data ingestionRaw events and source freshnessMissing partitions, duplicate events, schema drift
Feature pipelineTransformations used by training and servingTraining-serving skew, stale features
Training pipelineDataset, algorithm, hyperparameters, evaluationNon-reproducible model, leakage
Model registryArtifact versions and promotion stateWrong model deployed, missing lineage
Serving layerOnline or batch predictionLatency spike, resource exhaustion
MonitoringData, prediction, quality, and business signalsSilent degradation
Human reviewApproval for risky model changesOptimizing proxy metrics that hurt users

ML System Control Planes

Production ML systems have several control planes. Treating all of them as "the model" hides the actual blast radius.

PlaneControlsExample incident when weak
Data planeIngestion, labels, joins, backfills, retentionModel learns from duplicated events or leaked future labels
Feature planeFeature definitions, online/offline parity, freshnessOnline model sees stale or semantically changed features
Model planeArtifacts, registry state, runtime, rollbackWrong artifact or incompatible runtime reaches production
Decision planeThresholds, policies, fallbacks, human reviewBetter AUC creates worse user actions because policy stayed old
Experiment planeAssignment, exposure logging, metrics, guardrailsTeam ships a model because of biased or broken experiment data
Governance planeRisk tier, ownership, audit, approval, retirementHigh-impact model runs with no owner or appeal path

If a design review cannot say which plane owns a change, the system is not ready for production.


Training vs Serving

Training optimizes quality over large historical datasets. Serving optimizes latency and reliability under live traffic. The hard part is making sure both paths compute the same meaning for the same feature names.


Problem-to-Architecture Matrix

Different ML problems need different system shapes.

ProblemTypical architectureLatency pressureMain risk
Fraud / abuse decisionOnline model + feature store + rules fallback + review queueHighFalse positives and delayed labels
Recommendation feedCandidate generation + ranker + re-ranker + exploration logsHighFeedback loops and objective mismatch
Search rankingRetrieval + ranking + interleaving/A-B testsHighPosition bias and stale indexes
Churn / lifecycle predictionBatch scoring + campaign systemLowStale segments and weak causal attribution
ForecastingBatch/streaming pipeline + planning workflowMediumBacktest leakage and seasonality shifts
Content moderationMulti-stage classifier + policy thresholds + human reviewMediumIrreversible action and policy drift
Anomaly detectionStreaming features + online scoring + alert routingMediumAlert fatigue and baseline drift

The architecture should follow the decision loop. A fraud system needs fast features and review controls; a churn model needs reproducible batch scoring and causal measurement; a recommender needs exposure logs and exploration.


Core Design Decisions

Batch, Online, or Streaming Prediction

ModeUse whenAvoid when
Batch predictionResults can be precomputed, latency budget is hoursDecisions depend on fresh request context
Online predictionUser-facing decision must be made nowModel is too slow or too costly per request
Streaming predictionContinuous event decisions or near-real-time scoringState handling and exactly-once guarantees are immature
HybridCandidate generation can be offline, final ranking onlineOwnership between offline and online teams is unclear

Model as Library vs Service

DeploymentStrengthWeakness
Embedded libraryLowest latency, simple local callHarder to update independently
Shared model serviceCentralized rollout and observabilityAdds network hop and service dependency
Batch scoring jobCheap and controllableStale predictions
Edge modelWorks near device/userHard model update and observability problem

Rules, ML, or LLM

ApproachUse whenWatch out for
RulesLogic is explicit, stable, and explainableRule explosion and hidden ordering bugs
Classic MLMany examples exist and prediction target is measurableData drift, leakage, and proxy metrics
LLMTask needs language reasoning or flexible generationCost, nondeterminism, prompt injection, evaluation difficulty
HybridRules define safety boundaries; ML ranks or scores inside themOwnership between policy and model can blur

Do not use ML to hide unclear product policy. First define the action, fallback, and acceptable failure mode.


Failure Modes

Training-Serving Skew

Training uses one transformation and serving uses another. Offline evaluation looks good, but production quality drops.

Mitigations:

  • Use shared feature definitions.
  • Test training and serving feature parity.
  • Log serving features for replay.
  • Compare online feature values against offline recomputation.

Data Leakage

Training data includes information unavailable at prediction time. This often appears through timestamps, joins, or labels written back into source tables.

Mitigations:

  • Build point-in-time correct datasets.
  • Separate event time from processing time.
  • Review every feature for availability at decision time.
  • Run leakage tests against suspicious high-performing features.

Silent Model Degradation

The service stays up, latency is fine, and errors are low, but predictions become worse because the world changed.

Mitigations:

  • Monitor input distributions and prediction distributions.
  • Track delayed labels when available.
  • Tie model metrics to business and user impact metrics.
  • Keep rollback and champion/challenger paths available.

Feedback Loops

The model influences future training data. Recommenders, ads, search, abuse detection, and marketplace systems are especially exposed.

Mitigations:

  • Preserve exploration traffic.
  • Log candidates that were not shown.
  • Separate observational metrics from causal experiments.
  • Evaluate on holdout traffic not fully controlled by the current model.

Proxy Objective Mismatch

The model optimizes a metric that is easy to label but not the outcome the system actually needs.

Examples:

  • Optimizing click-through rate increases low-quality clickbait.
  • Optimizing fraud recall blocks too many legitimate users.
  • Optimizing watch time reduces long-term satisfaction.

Mitigations:

  • Define a metric hierarchy: primary, guardrail, diagnostic, slice.
  • Review top false positives and false negatives, not just aggregate metrics.
  • Promote models through online experiments when user impact matters.
  • Keep business policy outside the model when it must be reviewed explicitly.

Operational Metrics

LayerMetrics
DataFreshness, completeness, null rate, duplicate rate, schema changes
FeaturesOnline/offline skew, feature freshness, value distribution drift
TrainingPipeline duration, failure rate, data version, artifact hash, reproducibility
EvaluationPrecision/recall, calibration, loss, fairness slices, business metric deltas
Servingp50/p95/p99 latency, error rate, timeout rate, CPU/GPU utilization, queue depth
Model behaviorPrediction distribution, confidence distribution, drift score, rejection rate
BusinessConversion, fraud loss, retention, revenue, user complaints, manual review load

Architecture Review Checklist

  • Is every training dataset reproducible from versioned code and data snapshots?
  • Are features point-in-time correct?
  • Are online and offline features defined from the same contract?
  • Can a model be rolled back without rolling back application code?
  • Can production predictions be traced to model version, feature values, and request context?
  • Are quality metrics monitored after deployment, not only before deployment?
  • Is there a plan for delayed labels and missing labels?
  • Does the system have a safe exploration path to avoid self-confirming feedback loops?

Maturity Model

LevelCharacteristicsRisk
0. Notebook modelManual data pulls, ad hoc evaluation, manual deployNot reproducible
1. Scheduled trainingPipeline exists, but weak lineage and manual promotionBad data can train quietly
2. Registered modelsArtifacts, metrics, and owners in a registryServing and feature parity may still drift
3. Controlled rolloutShadow/canary, rollback, monitoring, feature contractsDelayed labels still require process
4. Governed decision systemRisk tiering, audit logs, human controls, retirementHigher process cost

Most teams should not jump from Level 0 to Level 4. Move one risk boundary at a time: reproducibility, then registry, then controlled rollout, then governance.


When to Use ML

Use ML when the decision boundary is hard to express as rules, enough labeled or behavioral data exists, and small errors are acceptable or reviewable.

Do not use ML when deterministic rules are sufficient, the cost of wrong decisions is unacceptable without human review, the data distribution is unstable with no monitoring plan, or the organization cannot own the lifecycle after launch.


Key Takeaways

  1. The model is only one artifact in an ML system.
  2. Data, features, labels, and feedback loops are production dependencies.
  3. Training and serving must be designed together.
  4. Offline evaluation is necessary but not sufficient.
  5. Monitoring model behavior matters as much as monitoring service uptime.
  6. Rollback, lineage, and reproducibility are core reliability features.

References

  1. Hidden Technical Debt in Machine Learning Systems
  2. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform
  3. Data Validation for Machine Learning
  4. TensorFlow Serving: Flexible, High-Performance ML Serving
  5. Rules of Machine Learning

A practical reference for distributed system design. Released under the MIT License.