Skip to content

Model Monitoring

TL;DR

Model monitoring detects whether a model is still useful, not just whether the prediction service is up. Monitor four layers: data quality, feature and prediction drift, model quality, and business impact. The hardest part is label delay: the real outcome may arrive hours, days, or months after the prediction.

This is the model-quality complement to infrastructure observability: pair it with Metrics & Monitoring and Alerting, express degradation budgets as SLOs & Error Budgets, and wire the action path into Incident Management. For LLM output quality specifically, see LLM Evaluation.


Why Service Monitoring Is Not Enough

HTTP 200 responses can hide bad predictions. A model can return quickly and consistently while harming conversion, missing fraud, or ranking low-quality content.


Monitoring Layers

LayerQuestionExample signals
Data qualityIs the input valid?Null rate, schema changes, range checks
Feature freshnessIs the input current?Last update age, lookup miss rate
DriftIs production different from training?Distribution distance, category shifts
Prediction behaviorIs the model behaving differently?Score distribution, confidence, class mix
QualityIs the model correct?Precision, recall, calibration, loss
Business impactIs the system outcome healthy?Revenue, fraud loss, retention, complaints

Metric Timing Ladder

ML monitoring must handle signals that arrive at different speeds.

TimingSignalUse
ImmediateLatency, errors, feature misses, fallback rateAvailability and rollout safety
Near-real-timeScore distribution, class mix, drift, freshnessDetect behavior shifts
Short-delay proxyClicks, manual review rate, session actionsEarly product signal
Delayed ground truthChargebacks, retention, loan default, abuse appealTrue quality decision
Periodic auditSlice fairness, policy review, human review qualityGovernance and long-term risk

Do not let fast proxy metrics permanently replace delayed ground truth. Use proxies to decide whether to pause, not necessarily whether to launch fully.


Drift Types

Data Drift

The input distribution changes.

Example: a loan model trained on one geography starts receiving traffic from another region.

Concept Drift

The relationship between input and label changes.

Example: fraud patterns change after attackers adapt to the current model.

Prediction Drift

The distribution of model outputs changes.

Example: a recommender starts assigning extremely high scores to a narrow item category.

Label Drift

The target distribution changes.

Example: the base rate of spam changes during an attack campaign.


Drift Detection Decision Matrix

SignalWorks forWeakness
Null/default rateBroken pipelines and missing joinsDoes not catch semantic drift
Population Stability IndexTabular feature distribution shiftsSensitive to binning
KS testNumeric feature distribution changesHigh traffic makes tiny shifts significant
Category distribution deltaEnum/category changesLong tail can be noisy
Embedding centroid shiftVector representation driftHard to explain
Prediction distributionOutput behavior changeCannot tell whether input or model caused it
Slice qualityReal user impact by segmentNeeds labels and enough traffic

Drift is a smoke alarm, not a root cause. The response should be triage: source data, feature materialization, serving version, traffic mix, then true quality.


Label Delay

Some systems get labels quickly, such as click/no-click. Others wait weeks, such as fraud chargebacks or loan defaults. Monitoring must separate fast proxy metrics from delayed ground truth.


Triage Flow

The first question is operational: did the system start serving something different? Only then move to model quality.


Prediction Logging

A prediction log should include:

  • Request ID and timestamp.
  • Entity IDs.
  • Model name and version.
  • Feature vector or feature references.
  • Prediction and confidence.
  • Routing path: champion, canary, shadow, fallback.
  • Latency and error metadata.
  • Experiment assignment.
  • Later joined label and outcome timestamp.

Do not log raw sensitive data unless policy allows it. Prefer references, hashed IDs, or approved feature values.


Alerting Strategy

Not every drift alert should page someone. Use severity based on user impact and confidence.

AlertPage?Response
Serving error rate highYesRestore availability
Feature freshness SLO missed for critical modelYesFail over or disable model
Prediction distribution shifts sharplyUsually noInvestigate during business hours unless tied to impact
Delayed quality metric drops below guardrailSometimesRoll back or reduce traffic
Business KPI regression in canaryYes for critical flowsStop rollout

Monitoring Pipeline

Monitoring should produce action, not just charts. Every alert needs an owner and a playbook.


Model Quality SLOs

Model SLOs should define both service health and decision quality.

SLOExample
Availability99.9% prediction requests return within policy
Latencyp99 prediction latency below 100 ms
FreshnessCritical features updated within 120 seconds
FallbackRules fallback below 1% outside incidents
QualityFalse positive rate below agreed threshold on mature labels
Slice guardrailNo critical segment falls below minimum precision/recall
Review loadManual review queue p95 age below target

Quality SLOs often lag. Pair them with immediate guardrails so bad rollouts can be paused before final labels arrive.


Common Failure Modes

Proxy Metric Trap

The proxy metric improves while the real user or business outcome degrades.

Mitigation: track guardrails, run experiments, and avoid promoting models on a single metric.

Hidden Slice Regression

Overall quality is stable, but one segment degrades.

Mitigation: monitor important slices: geography, device, language, tenant, risk bucket, new users, and protected classes where legally appropriate.

Monitoring Training Data Instead of Production Data

The dashboard shows clean offline validation data, not live request distribution.

Mitigation: monitor production prediction logs and compare them with training baselines.

Alert Fatigue

Drift metrics are noisy and constantly fire.

Mitigation: tune thresholds by actionability, use burn-rate style alerts for severe degradation, and route low-confidence alerts to review queues.


Operational Metrics

CategoryMetrics
Data qualityMissing rate, invalid enum rate, range violation rate
FreshnessFeature age, ingestion lag, materialization lag
DriftPSI, KL divergence, Wasserstein distance, category distribution delta
PredictionScore histogram, class ratio, calibration buckets
QualityPrecision, recall, AUC, loss, false positive rate, false negative rate
SlicesQuality by segment, traffic share by segment
OperationsAlert volume, time to detect, time to rollback, retrain frequency

Key Takeaways

  1. A healthy service can serve bad predictions.
  2. Monitor production input and prediction distributions, not just offline validation.
  3. Label delay determines how quickly quality can be measured.
  4. Slice monitoring catches regressions hidden by aggregate metrics.
  5. Monitoring must connect to rollback, retraining, or investigation workflows.

References

  1. Hidden Technical Debt in Machine Learning Systems
  2. Data Validation for Machine Learning
  3. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform
  4. Evidently Documentation

A practical reference for distributed system design. Released under the MIT License.