Skip to content

Feature Stores

TL;DR

A feature store manages reusable ML features across offline training and online serving. Its job is not only storage; it is consistency. The hard requirements are point-in-time correctness, feature freshness, schema/version control, discoverability, ownership, and parity between training and serving.


The Problem

Without a feature store, every model team builds its own feature pipeline.

This creates duplicated logic, inconsistent definitions, and hard-to-debug production behavior. "User purchase count in last 7 days" might mean three different things across teams.


Feature Store Architecture

The same feature contract should drive both materialization paths. The storage engines can differ; the semantics should not.


Feature View Design

A feature view is the unit of ownership and materialization. Design it around entity, time, freshness, and use case.

DimensionDesign questionExample
EntityWhat key is scored?user_id, merchant_id, item_id, device_id
Time windowWhat history is summarized?10 minutes, 7 days, lifetime
FreshnessHow old can the value be?30 seconds for fraud, 24 hours for churn
Source of truthWhich event/table owns the fact?Payment events, login stream, item catalog
MaterializationHow does it reach offline/online stores?Batch, streaming, request-time
Default behaviorWhat happens on miss/null?0, unknown bucket, fail closed

Bad feature views usually mix too many entities or hide time semantics in the name. risk_score is vague; user_failed_login_count_10m is reviewable.


Offline vs Online Store

StoreOptimized forTypical systemsMain risk
Offline storeHistorical joins, scans, backfillsBigQuery, Snowflake, Spark, Delta, HivePoint-in-time bugs
Online storeLow-latency lookupsRedis, DynamoDB, Cassandra, RocksDBStaleness, hot keys
Metadata storeDiscovery and lineageCatalog DB, registry serviceUndocumented ownership

The online store is a low-latency cache keyed by entity; hot entities create the same hot-key/partitioning problems as any read-heavy store, and the materialization path that keeps it fresh is typically a change-data-capture stream off the source events.


Materialization Patterns

PatternUse whenFailure mode
Batch materializationFeatures tolerate hours of stalenessLate jobs create stale online values
Streaming materializationFeatures need seconds/minutes freshnessDuplicates, out-of-order events, replay bugs
Request-time featuresFeature depends on current requestLatency spikes and dependency fanout
HybridHistorical aggregates plus request contextOnline/offline parity is harder
On-demand backfillRecover after bug or add new featureExpensive recomputation and version confusion

Streaming materialization still needs idempotency. If the same event is replayed, counters and windows must not double count.


Point-in-Time Correctness

Training must only use feature values that were available at the prediction time.

At prediction time 10:05, the 10:10 feature value was not available. A training join that uses it leaks future information.

Correct dataset construction needs:

  • Event timestamp: when the fact happened.
  • Ingestion timestamp: when the system received it.
  • Availability timestamp: when the feature could be served.
  • Entity key: the user, account, item, device, or session being scored.

Point-in-Time Join Rules

RuleWhy
Join features using availability time, not processing completion timePrevents future leakage
Store feature history, not only latest valuesEnables training snapshots and replay
Include late-arriving events policyMakes backtests match production
Version backfillsDistinguishes original production value from corrected historical value
Log online feature valuesAllows parity checks and incident reconstruction

Feature Freshness

Feature freshness is a service-level objective (SLOs & Error Budgets). A fraud model might need second-level freshness; a churn model might tolerate daily updates.

Feature typeFreshness needExample
StaticRarely changesUser country, signup channel
Slowly changingHours to daysAccount age bucket, historical spend
Near-real-timeSeconds to minutesFailed login count, active session count
Request-timeComputed per requestCart value, device fingerprint

Freshness should be declared in metadata and monitored in production.


Schema Evolution

Feature changes are API changes for models.

ChangeCompatibilityRollout
Add optional featureUsually compatibleBackfill offline, then expose online
Add required featureBreaking for old serving pathDeploy feature before model uses it
Rename featureBreakingDual-write old and new names during migration
Change typeBreakingNew versioned feature name
Change semanticsBreaking even if type matchesNew version and owner approval
Change defaultRiskyEvaluate slices where missingness is common

If a feature's meaning changes, prefer a new feature name. Type compatibility does not imply semantic compatibility.


Feature Contracts

A feature contract should define:

  • Name and description.
  • Entity keys.
  • Value type and allowed range.
  • Owner and on-call contact.
  • Freshness SLO.
  • Offline source and online source.
  • Backfill behavior.
  • Null/default behavior.
  • Deprecation plan.

Example:

yaml
name: user_failed_login_count_10m
entity: user_id
type: int64
freshness_slo: 120s
default: 0
owner: identity-risk
offline_source: warehouse.login_events
online_source: redis:user-risk
availability_timestamp: materialized_at

Failure Modes

Training-Serving Skew

The offline query and online transformation drift apart.

Mitigation: generate both paths from one feature definition or run parity tests that replay logged online requests through the offline pipeline.

Stale Online Features

The online store is available but no longer receiving updates.

Mitigation: monitor age of latest feature update per feature group and fail closed or route to a fallback model when freshness exceeds the budget.

Hot Entities

Popular users, items, or merchants create hot keys in the online store.

Mitigation: cache local reads, split keys by time window, shard large entities, or precompute aggregate features for hot entities.

Unowned Features

Models depend on features whose upstream team no longer maintains the source semantics.

Mitigation: require owner metadata, usage tracking, deprecation notices, and feature-level change review.


Build vs Buy Decision Matrix

SituationPrefer simple pipelinePrefer feature store
One offline modelYesUsually no
Many models share the same featuresNoYes
Online inference needs low-latency featuresMaybeYes
Regulated decisions need lineageNoYes
Team lacks platform ownershipYesOnly with managed service
Features change weekly across teamsNoYes

A feature store without ownership becomes another database. The platform must own contracts, metadata, freshness monitoring, and deprecation.


Operational Metrics

MetricWhy it matters
Feature freshness lagDetects broken materialization
Online lookup latencyAffects prediction p99
Online lookup miss rateReveals keying or backfill gaps
Null/default rateDetects source regressions
Offline/online parity deltaDetects skew
Feature usage countSupports cleanup and ownership
Backfill durationDetermines recovery time after pipeline bugs

When to Use a Feature Store

Use a feature store when multiple models share features, online/offline consistency matters, feature freshness is operationally important, or regulated decisions require lineage.

Avoid a full feature store for a single offline model with no serving path. A versioned dataset and clear pipeline may be enough.


Key Takeaways

  1. A feature store is a consistency system, not just a database.
  2. Point-in-time correctness prevents future data leakage.
  3. Feature freshness must be monitored like an SLO.
  4. Offline and online storage can differ, but semantics must match.
  5. Feature ownership and deprecation are production reliability concerns.

References

  1. Feast Documentation
  2. Data Validation for Machine Learning
  3. Hidden Technical Debt in Machine Learning Systems
  4. Uber Michelangelo: Machine Learning Platform

A practical reference for distributed system design. Released under the MIT License.