FinOps and Cost Engineering
TL;DR
Cloud moved infrastructure spend from a procurement decision to an engineering decision made implicitly, thousands of times a day, by whoever writes the code — FinOps is the discipline of making those decisions visible and deliberate. The practice: measure unit economics (cost per request, per tenant, per feature) rather than staring at the total bill; allocate spend via enforced tagging so every dollar has an owner; attack the levers in efficiency order — turn it off, right-size, tier the storage, mind the egress, then rate-optimize with commitments and spot; and wire cost into the engineering loop the way you wired latency: cost-diff estimates on infrastructure PRs, anomaly alerts within hours not at month-end, and per-tenant attribution feeding pricing decisions. Cost is just another operational signal — treat it with the same machinery as SLOs, and remember the goal is not minimal spend but maximal margin: unspent efficiency, like unspent error budget, is wasted velocity.
Unit Economics Beat Bill Watching
A monthly bill of $480K is uninterpretable — terrifying when revenue is flat, excellent when usage tripled. The signal lives in ratios:
unit cost = spend attributable to a workload / units of value it produced
= $ / 1K requests · $ / active tenant · $ / GB ingested ·
$ / model training run · $ / 1M tokens servedUnit costs separate growth (bill up, unit cost flat — fine, that's success) from regression (unit cost up — something got less efficient) and make engineering trade-offs commensurable: "this cache tier costs $8K/month and cuts cost-per-request 22% while halving p99" is a sentence both finance and engineering can evaluate (Caching decisions are FinOps decisions). Pick 3–5 unit metrics that mirror your journey-level SLOs, trend them per service, and alert on their derivative.
Allocation: every dollar needs an owner
Unit economics require knowing which spend belongs to which workload — the unglamorous foundation:
- Tag at creation, enforce in CI: team, service, environment, tenant-tier on every resource, validated by IaC policy (untagged = blocked at plan time, not lamented at month-end). GitOps makes this enforceable because all resources flow through reviewed code.
- Shared platforms need metering: Kubernetes clusters, data platforms, and internal ML serving are one line on the bill but many consumers — allocate by requested resources (requests reserve capacity whether used or not; OpenCost-style allocation), and meter multi-tenant services per tenant (the same tenant-tagged metrics you built for noisy-neighbor analysis double as the cost meter; whale-tenant gross margin is a number your pricing team needs).
- Accept imperfection structurally: shared costs (NAT, observability, support plans) get a published split rule (proportional to direct spend is fine). An 85%-allocated bill with clear ownership beats a 100%-allocated one nobody trusts.
- Mind the feedback latency: billing data lags hours-to-a-day; anomaly detection on provider cost APIs + your own usage metrics catches the runaway training job today instead of on the invoice (Alerting: cost anomalies page the owning team, scaled by burn rate — a 10×-normal hourly burn is an incident).
The Levers, In Order
Efficiency before rates: optimizing the price of waste is still waste.
| # | Lever | Mechanism | Typical impact |
|---|---|---|---|
| 1 | Turn it off | Idle dev/staging nights+weekends, zombie resources (unattached volumes, idle LBs, forgotten snapshots), scale-to-zero for spiky internal tools | 10–30% of many bills is nothing |
| 2 | Right-size | Fit instances/requests to observed p95 usage, not founding-era guesses; one size class down ≈ −30–50% on that fleet | Continuous, automatable |
| 3 | Storage lifecycle | Hot → infrequent → archive policies; snapshot/log retention limits; compress + columnar (Parquet) | Storage grows monotonically unless told otherwise |
| 4 | Egress & topology | Cross-AZ and cross-region traffic, NAT processing, internet egress — the silent line items. Co-locate chatty services; cache at the edge (CDN); move compute to data, not data to compute | Often the most shocking audit finding |
| 5 | Commitments | Reserved/savings plans for the measured baseline (~60–80% coverage; review quarterly) | −30–60% on committed compute, zero code changes |
| 6 | Spot/preemptible | Interruption-tolerant work: batch, CI, stateless fleets with headroom, training with checkpoints — i.e., workloads you already made idempotent and resumable | −60–90% on eligible compute |
| 7 | Architecture | Tiered tenancy (pool the long tail), batch over per-event processing where latency allows, async over sync chains, ARM/efficiency silicon | The compounding, slow lever |
Two notes on the table. Commitments are a forecasting bet — commit to the floor you're sure of, cover spikes on-demand/spot; over-commitment converts the discount into lock-in. Spot is an architecture test: if interruption with a 2-minute warning breaks the workload, that fragility was already a reliability bug (Retries, checkpointing) that spot merely prices.
The LLM-era addendum
Token spend is the fastest-growing line on many 2026 bills, and it behaves like a utility: cost per solved task is the unit metric (LLM Evaluation), and the levers have their own ranking — prompt-cache hit rate first, model tiering second, batch-tier routing for async work, output-length discipline, then provider negotiation (LLM Infrastructure and Harness Engineering cover the mechanics). The FinOps Foundation's scope extension to SaaS/AI spend reflects the same shift: the bill you can engineer is no longer only the IaaS bill.
Wiring Cost Into the Engineering Loop
The cultural failure mode is cost-as-quarterly-cleanup: a heroic audit, 25% savings, regrowth within two quarters. The fix is the same as for quality and reliability — move the signal to where decisions happen:
- Cost-diff on PRs: infrastructure changes show their monthly delta in review, exactly like a bundle-size or coverage check. A reviewer who sees "+$4,200/mo" asks questions a month-end report never provokes.
- Budgets as SLOs: each service gets a unit-cost target and an absolute guardrail; breaches open tickets through the normal incident/error-budget machinery, not a finance email thread. (And symmetrically — chronically under target with slipping latency SLOs means you over-optimized; spend it.)
- Showback before chargeback: publish per-team dashboards first (visibility changes behavior on its own); move to internal billing only where incentives genuinely need teeth — chargeback wars over allocation rules can cost more attention than they save money.
- Forecast architecture, not just trends: the big cost events are step functions — a new feature, a tenant 10× the median, a region addition (multi-region roughly doubles infrastructure as a planned line item, not a surprise).
- Make the efficient path the default path: golden IaC modules with lifecycle policies, autoscaling, and right-sized defaults baked in beat any amount of after-the-fact policing — platform engineering is where FinOps compounds.
Checklist
- [ ] 3–5 unit-cost metrics defined, trended per service, alerted on derivative
- [ ] Tagging enforced at IaC plan time; shared platforms metered (K8s by requests, multi-tenant services by tenant)
- [ ] Cost anomaly detection on hours-latency data, paging the owning team
- [ ] Idle/zombie sweep automated; right-sizing recommendations applied on a cadence
- [ ] Storage lifecycle + retention policies on every bucket, log group, and snapshot chain
- [ ] Egress/cross-AZ topology reviewed; chatty services co-located
- [ ] Commitments cover the measured baseline only; coverage reviewed quarterly
- [ ] Spot adopted for interruption-tolerant tiers (and the interruption-tolerance actually tested)
- [ ] Cost-diff visible on infra PRs; unit-cost budgets wired to the incident process
- [ ] Per-tenant cost attribution feeding pricing/margin decisions; LLM cost-per-solved-task tracked
References
- FinOps Foundation Framework — phases (inform/optimize/operate), personas, and the FOCUS billing-data standard
- AWS Well-Architected: Cost Optimization Pillar — the lever catalog, provider-flavored
- The Frugal Architect — Werner Vogels; cost as a nonfunctional requirement
- OpenCost — Kubernetes cost allocation (the requests-vs-usage model)
- Infracost — cost diffs in pull requests