Skip to content

Model Serving

TL;DR

Model serving turns trained artifacts into production predictions. The design space is shaped by latency, throughput, model size, hardware, feature freshness, rollout safety, and observability. A good serving system can load model versions, route traffic, batch requests, enforce timeouts, explain failures, and roll back independently from application code.

Serving is a production service, so the general patterns apply directly: capacity planning for the latency budget below, retries, timeouts, and hedging for tail control, deployment strategies for canary/blue-green, and autoscaling for capacity. LLM serving adds its own regime — continuous batching, KV caches, prefill/decode split — covered in LLM Infrastructure.


Serving Modes

ModeLatencyThroughputUse case
Batch scoringMinutes to hoursVery highDaily recommendations, churn scores
Online synchronousMilliseconds to secondsMediumFraud, ranking, personalization
Online asynchronousSeconds to minutesHighEnrichment, review queues
Streaming inferenceMilliseconds to seconds per eventHighAbuse detection, anomaly detection
Edge inferenceLocalDevice-boundOffline apps, privacy-sensitive features

Serving Topologies

TopologyBest forTrade-off
Embedded model libraryUltra-low latency, simple modelsHard to update and observe centrally
Sidecar predictorService-local latency with separate model processMore deployment complexity
Central prediction serviceShared rollout, logging, governanceNetwork hop and shared dependency
Model mesh / multi-model serverMany models with common runtimeNoisy neighbors and routing complexity
Async scoring queueNon-blocking enrichmentDelayed decisions and queue semantics
Batch scoringCheap high-throughput predictionsStaleness and invalidation

The topology should follow the decision criticality. Fraud authorization usually needs online synchronous serving with strict fallback; daily marketing scores usually belong in batch.


Online Serving Architecture

The prediction log is essential. It should capture request metadata, model version, feature values or references, prediction, latency, and later label joins.


Prediction API Contract

The model server interface should be stable even as artifacts change.

yaml
request:
  request_id: string
  entity_id: string
  model_name: string
  feature_refs: object
  context: object
response:
  model_version: string
  policy_version: string
  score: number
  decision: string
  confidence: number
  explanations_ref: string
  fallback_used: boolean

The response should include the model and policy version so downstream logs can reconstruct the decision. Returning only a score makes incident analysis painful.


Latency Budget

text
Total p99 budget: 100 ms

Network ingress       10 ms
Auth/routing           5 ms
Feature lookup        25 ms
Model inference       40 ms
Post-processing       10 ms
Logging/egress        10 ms

If feature lookup consumes the whole budget, optimizing the model will not fix the user experience. Budget each step before choosing serving hardware.


Feature Fetch Patterns

PatternUse whenRisk
Gateway fetches featuresNeed central logging and fallbackGateway becomes a bottleneck
Model server fetches featuresModel owns feature setHidden dependency fanout
Caller provides featuresCaller already has contextTraining-serving skew across callers
Precomputed feature vectorTight p99 budgetStale values
Two-pass fetchCheap features first, expensive only for likely positivesComplex logic and biased logs

Feature fetch is often more fragile than inference. Treat the feature store as a dependency with its own SLO, timeout, and fallback.


Model Versioning and Routing

Common routing policies:

  • Champion/challenger: compare production model against candidate.
  • Canary: send small live traffic to candidate and watch guardrails.
  • Shadow: run candidate without affecting response.
  • Segment routing: send a model to a region, tenant, device class, or risk tier.
  • Fallback: route to simpler model or rules when the primary path fails.

Batching

Batching improves throughput but can increase tail latency.

StrategyStrengthRisk
No batchingPredictable latencyLow hardware utilization
Fixed batchSimple capacity planningWaits for batch to fill
Dynamic batchingBetter utilization under variable loadMore complex p99 behavior
Continuous batchingHigh GPU utilization for large modelsScheduler complexity

Use batching when the model is compute-heavy and requests can wait briefly. Avoid it for extremely tight latency budgets unless the serving framework gives strong p99 controls.


Capacity Planning

Start with a simple estimate:

text
required_workers =
  peak_qps * p99_inference_seconds / target_utilization

Then add headroom for:

  • Feature-store latency spikes.
  • Model load time and rolling deploy capacity.
  • Canary/shadow traffic.
  • Batch size variance.
  • Accelerator memory fragmentation.
  • Regional failover.

For GPU-backed serving, memory often limits capacity before raw compute does. Track maximum resident model memory, activation memory, and concurrent batch memory separately.


Autoscaling

Autoscale on serving-specific signals, not only CPU:

  • Request rate.
  • Queue depth.
  • Inference latency.
  • GPU utilization and memory.
  • Model load time.
  • Feature lookup latency.
  • Timeout rate.

Large models make scale-from-zero risky because cold start can take minutes. Keep warm capacity for latency-critical models.


Degradation Ladder

Define fallback behavior before incidents.

Each step should be explicit about user impact. A "safe default" for fraud may be manual review; a "safe default" for recommendations may be popular content.


Failure Modes

Model Load Failure

A new artifact cannot be loaded because of incompatible runtime, missing dependency, wrong tensor shape, or corrupt artifact.

Mitigation: validate artifacts before promotion, use staged rollout, keep previous model loaded until the new model passes health checks.

Feature Fetch Timeout

The model server is healthy but upstream feature retrieval fails.

Mitigation: enforce strict timeouts, define fallback features, use cached features when safe, and measure feature-store availability separately.

Tail Latency Collapse

Average latency is fine, but p99 rises during bursts because queues grow faster than workers drain them.

Mitigation: queue limits, load shedding, admission control (Backpressure), separate pools for expensive models, and capacity tests at expected burst size.

Silent Wrong Model

The service deploys a valid model artifact that belongs to the wrong dataset, segment, or feature schema.

Mitigation: require model cards or metadata checks, schema compatibility gates, artifact hashes, and model-version logging on every prediction.


Deployment Patterns

PatternUse whenWatch out for
Blue-greenNeed fast rollback of whole model serviceDouble capacity
CanaryWant gradual live validationWeak signal at low traffic
ShadowNeed compare without user impactShadow feature load can still affect dependencies
Multi-armed banditOptimization objective is measurable quicklyCan exploit short-term proxy metrics
Rules fallbackModel can fail open or fail closed safelyRule path may drift from model path

Operational Metrics

LayerMetrics
RequestQPS, p50/p95/p99 latency, timeout rate, error rate
QueueQueue depth, wait time, dropped requests
ModelInference time, model load time, version, memory usage
HardwareCPU/GPU utilization, GPU memory, accelerator errors
FeaturesLookup latency, freshness, miss rate
QualityOnline guardrails, delayed labels, drift, calibration

Key Takeaways

  1. Model serving is a production service with model-specific failure modes.
  2. Roll out model artifacts independently from application code.
  3. Prediction logs are required for monitoring, debugging, and retraining.
  4. Batching improves throughput but must be managed against p99 latency.
  5. Always design fallback behavior before deployment.

References

  1. TensorFlow Serving: Flexible, High-Performance ML Serving
  2. KServe Documentation
  3. MLflow Model Registry
  4. Hidden Technical Debt in Machine Learning Systems

A practical reference for distributed system design. Released under the MIT License.