Migration Strategies: Strangler Fig and System Replacement

TL;DR

Replacing a running system is the most common shape of senior engineering work, and the rule is: never big-bang. Risk in a rewrite accumulates with time-without-feedback, so every viable strategy is a way of shipping the replacement in slices that carry production traffic early. The toolkit: strangler fig (put a routing seam in front, move one slice at a time, shrink the legacy until it's gone), branch by abstraction (the same move inside one codebase, behind an interface and a flag), and dual-run/shadow verification (run both, diff the outputs, let data — not confidence — authorize cutover; mind the side effects). Data ownership during the transition follows expand/contract with a single writer per phase. Rollback must equal a routing change until the explicitly named point of no return. And the part everyone skips: sunset discipline — a migration is done when the old system is deleted, not when the new one works; "97% migrated forever" is the default outcome unless someone owns the burn-down.

Why Big-Bang Rewrites Fail

The rewrite fantasy: freeze features, build clean for 18 months, switch over one weekend. The mechanics of why it fails are predictable enough to be a law:

Feedback starvation. Risk grows with every week the new system carries no production traffic. The legacy system is the spec — decades of edge cases, browser quirks, and "that one tenant's CSV format" live only in its code paths, and you discover them at cutover, all at once.
The moving target. The business won't actually freeze; the legacy keeps changing while you chase it, and the rewrite needs feature parity with a system that's still growing.
Second-system effect. The rewrite accumulates every deferred dream, arriving later and heavier.
One-way door. A weekend cutover has no incremental rollback — it's the largest possible change with the smallest possible undo.

Every pattern below is the same therapy applied at different layers: shrink the batch size of the migration until each step is boring (the same argument as CI/CD — batch size is the risk).

Strangler Fig

Named for the fig that grows around a tree until the tree is gone. Mechanics:

Establish the seam first — an API gateway, load balancer rule set, or facade service through which all traffic flows. The seam is the migration's control plane: per-route, per-tenant, per-percentage steering, instant rollback. No seam, no strangler.
Pick slices by risk × independence, not importance. First slice: something read-mostly, low-blast-radius, and weakly coupled (a reporting endpoint, a settings page) — its job is to prove the pipeline (seam, deploy, observability, rollback), not to deliver value. Then proceed by capability, saving the write-heavy, invariant-dense core for last, when the machinery is drilled.
Build an anti-corruption layer where new and legacy must interoperate mid-flight — explicit translation at the boundary, so the legacy's data model doesn't colonize the new system (the whole point was to escape it).
Intercept events, not just requests: if the legacy is fed by queues/batch jobs, the seam must cover those entry points too — the HTTP router everyone remembers; the nightly cron nobody does.
Run the migration as a product: a burn-down of capabilities with owners and dates, traffic-share dashboards per slice, and a standing rule that new features land in the new system only — otherwise the legacy keeps growing at one end while you strangle the other.

Branch by Abstraction

The strangler move when there's no network seam — replacing a subsystem inside a deployed codebase (an ORM, a storage layer, a rendering engine) without a long-lived branch:

Carve an interface over the old implementation; mechanically route all call sites through it (pure refactor, zero behavior change, shippable continuously).
Build the new implementation behind the same interface, in main, dark (trunk-based — the abstraction is the branch).
Switch per call site / per cohort with a feature flag; compare; ramp.
Delete the old implementation and then the abstraction itself if it no longer earns its keep.

The discipline this buys: the codebase compiles and ships at every commit during a months-long replacement, and "rollback" is a flag flip — versus the merge-from-hell of a six-month rewrite-v2 branch.

Dual-Run and Shadow Verification

Confidence should come from diffs, not vibes. Three intensities:

Mode	Mechanics	Catches
Shadow (dark) reads	Mirror traffic to the new system; discard its responses; diff against legacy's	Correctness gaps, perf under real load — at zero user risk
Dual-run compare (Scientist-style)	Run both in the request path, serve legacy, record mismatches	Same, plus exact per-request forensics; adds latency — sample it
Dual-write	Writes go to both stores during data migration	Keeps the new store warm and verifiable pre-cutover (expand/contract)

The cardinal rule: shadowed execution must not double side effects. Mirroring a request that sends email, charges a card, or emits events means doing it twice — the shadow path needs stubbed effectors, an isolated sink, or idempotency keys shared with the primary. (This is the same machinery as open-loop load testing — replayed production traffic with the side effects defused.)

Operate the diff like an SLO: mismatch rate per slice, triaged daily; the cutover gate is a sustained near-zero diff on real traffic (with known, documented benign diffs — timestamps, ordering — explicitly filtered, not hand-waved). GitHub's Scientist library named this practice; Shopify's storefront migration ran a verifier comparing old/new renderings pixel-for-request until parity was proven, then ramped.

The Data Plane During Migration

Code routes per-request; data has memory — this is where migrations actually get hard:

One writer per phase. At any moment, exactly one system owns writes for a given record; the other follows (dual-write from the owner, or CDC replication). Two masters during a migration is a conflict-resolution problem you volunteered for.
The standard sequence is the expand/contract playbook at system scale: new store shadows (backfill + ongoing sync) → verify (counts, checksums, reconciliation reports) → flip reads → flip writes (with reverse-sync running, so rollback stays possible) → retire reverse-sync only past the point of no return.
Name the point of no return explicitly (the moment new-system writes exist that the legacy cannot represent — the migration's pivot transaction). Before it, rollback is routing; after it, rollback is itself a migration. Schedule it consciously, late, and after a soak.
Tenant-by-tenant beats percentage ramps for stateful migrations (multi-tenancy): a tenant's whole dataset moves together, reconciliation has a clean boundary, and your design-partner tenants go first by agreement.

Sunset Discipline

Migrations don't fail at the start; they dissolve at the end — the team disbands at 95% and the company runs two systems forever (paying double infra, double on-call, double security surface, and the legacy now has no owner). Counter-measures:

Define done as deletion: traffic = 0, data archived per DR/retention policy, code deleted, infra torn down, runbooks/docs updated, dashboards removed. Not before.
The last 5% (the weird tenants, the cron from 2019, the endpoint only finance uses in Q4) needs an explicit decision per item: migrate, replace with something simpler, or formally kill with stakeholder sign-off. Unowned remainders are how "temporary" becomes architecture.
Make the legacy hostile to growth during the long tail: feature-freeze enforced in CI (changes to legacy paths require an exception label), deprecation warnings on its APIs with Sunset headers and per-caller metrics — you cannot turn off what you cannot enumerate the callers of.
Track and publish the burn-down monthly. What leadership sees gets finished.

When a rewrite is right

Strangling assumes the legacy can keep running and be sliced. Genuine rewrite territory: the platform is dying under you (EOL runtime, unmaintainable stack with no hiring pool), the core model is wrong in a way every slice would inherit, or the system is small enough that a rewrite is a quarter, not a year. Even then: build the new system to production quality on a subset of traffic first (one region, one tenant tier — a strangler at coarser grain), and keep the diffing harness. The pattern isn't optional; only the slice size is.

Checklist

[ ] Seam established before any slice moves; per-route/per-tenant steering + instant rollback proven
[ ] First slice chosen to derisk the machinery; new features frozen out of the legacy
[ ] Anti-corruption layer at every new↔legacy boundary; queue/cron entry points covered, not just HTTP
[ ] Shadow/dual-run diffing in place, side effects defused, mismatch rate tracked as the cutover gate
[ ] Data: single writer per phase, backfill + reconciliation, reverse-sync until the named point of no return
[ ] Rollback = routing change, rehearsed, until PONR; PONR scheduled deliberately after soak
[ ] Sunset = deletion, with burn-down, owners, legacy growth-freeze, and caller enumeration before turn-down

References

Strangler Fig Application and Branch by Abstraction — Fowler; the canonical write-ups
GitHub Scientist — dual-run experimentation as a library, with the side-effect caveats
Stripe: Online migrations at scale — the four-phase dual-write playbook
Shopify: How we rebuilt the storefront renderer — verifier-gated replacement at scale
Things You Should Never Do, Part I — Spolsky; the case against the big bang, still undefeated
Monolith to Microservices (Sam Newman) — the full pattern catalog around the strangler

Migration Strategies: Strangler Fig and System Replacement ​

TL;DR ​

Why Big-Bang Rewrites Fail ​

Strangler Fig ​

Branch by Abstraction ​

Dual-Run and Shadow Verification ​

The Data Plane During Migration ​

Sunset Discipline ​

When a rewrite is right ​

Checklist ​

References ​