Database Schema Migrations

TL;DR

Schema changes are deployments — the riskiest kind, because they're stateful, often lock-taking, and rarely reversible. Two rules generate every safe migration: (1) never couple a schema change and a code change in one step — every schema version must work with both the previous and next code version (N−1 compatibility), because deploys are rolling and rollbacks happen; (2) decompose every breaking change into expand → migrate → contract: add the new shape alongside the old, dual-write and backfill, switch reads, then remove the old shape only after verification. Use online-DDL mechanics (CREATE INDEX CONCURRENTLY, NOT VALID constraints, gh-ost) on large tables, run backfills as throttled idempotent batch jobs, lint migrations in CI, and plan to roll forward — down-migrations that drop data are a fiction.

Why Naive Migrations Cause Outages

ALTER TABLE looks innocent in development (1,000 rows) and takes a lock in production (1,000,000,000 rows). Two distinct failure classes:

Locking. DDL generally needs (briefly or for the duration) an exclusive lock. Even "instant" DDL in Postgres must acquire the ACCESS EXCLUSIVE lock — and if it queues behind one long-running query, every subsequent query queues behind it: a one-millisecond change causes a multi-minute outage. Always run DDL with lock_timeout (e.g., 2s) and retry, so a contended lock attempt fails fast instead of stalling the world.

Version skew. Deploys are rolling: for minutes (or, with canaries, hours) old code and new code run simultaneously against one schema — and a rollback can extend that to days. Rename a column in one step and the old pods crash instantly:

Hence the contract: schema change N must be compatible with code versions N−1 and N. Which immediately forbids, as single steps: renaming columns/tables, changing column types, adding NOT NULL to existing columns, dropping anything still referenced. All of those become multi-step dances.

Operation safety reference

Operation	Postgres	MySQL (8.0/InnoDB)
Add nullable column (no default rewrite)	✅ instant	✅ instant
Add column + volatile default	✅ (11+: no rewrite)	⚠️ often rewrite
Create index	⚠️ use `CONCURRENTLY`	✅ inplace (still I/O-heavy)
Add FK / CHECK constraint	⚠️ use `NOT VALID` + `VALIDATE`	⚠️ full scan
Add `NOT NULL`	⚠️ via `CHECK NOT VALID` route	⚠️ inplace, locks under load
Change column type / PK type	❌ expand-contract	❌ expand-contract or gh-ost
Rename column/table	❌ expand-contract (metadata-fast but breaks N−1 code)	same
Drop column	⚠️ only as the contract step	same

(Tools like strong_migrations and squawk encode exactly this table as CI lint rules — adopt one rather than relying on memory.)

Expand → Migrate → Contract

The universal recipe, shown for the canonical hard case — renaming full_name to display_name on a hot table:

Expand (schema): ALTER TABLE users ADD COLUMN display_name text; — additive, instant, invisible to v1 code.
Dual-write (code deploy): writes populate both columns; reads still use the old one. Crucially, this step is fully rollback-safe.
Backfill (data job): copy historical rows — see below. New writes are already correct via dual-write, so the backfill only chases history.
Verify: count mismatches (WHERE full_name IS DISTINCT FROM display_name), checksum samples, and ideally shadow-read — read both, compare, log diffs, serve the old value. Zero diffs for a representative window is the gate.
Switch reads (code deploy): read the new column; keep dual-writing so rollback to step 4 remains trivial.
Write new only (code deploy): stop touching the old column.
Contract (schema): drop the old column — days later, after you're certain no rollback will need it, and after checking nothing else reads it (views, reports, CDC consumers — the dropped column disappears from the change stream too).

Every step is individually deployable, individually reversible (until contract), and N−1 compatible. The same skeleton handles type changes (including the infamous int → bigint primary-key migration: new column, dual-write, backfill, swap with a brief rename transaction), table splits, and moving data between databases — only the dual-write/dual-read plumbing varies. Feature flags pair naturally with the read-switch step (Feature Flags): flip reads gradually, compare error rates, revert instantly.

Backfills at scale

A backfill is a batch job with a database in the blast radius. Requirements:

python

def backfill(batch_size=2000, max_replica_lag_s=5):
    last_id = checkpoint.load()                      # resumable
    while True:
        rows = db.execute("""
            UPDATE users SET display_name = full_name
            WHERE id > %s AND id <= %s AND display_name IS NULL
            RETURNING max(id)""", last_id, last_id + batch_size)
        if rows.empty: break
        last_id = rows.max_id
        checkpoint.save(last_id)                     # idempotent restarts
        while replica_lag() > max_replica_lag_s:     # throttle on real signals
            time.sleep(1)

Batch by primary-key range (not OFFSET), keep transactions short, commit per batch.
Throttle on observed health — replica lag, p99 latency — not a fixed sleep. The job should be the first thing to slow down, not your users. (Backpressure)
Idempotent + checkpointed: the IS NULL guard makes re-runs safe; the checkpoint makes restarts cheap. Long backfills will be interrupted.
For billion-row tables, run the backfill as a proper batch pipeline and budget days, not minutes. Boring is the goal.

Online DDL Mechanics

When the operation itself rewrites a big table, use the non-blocking machinery:

Postgres.

CREATE INDEX CONCURRENTLY — builds without blocking writes (slower; can't run in a transaction; on failure leaves an INVALID index to drop and retry).
Constraints in two phases: ALTER TABLE ... ADD CONSTRAINT ... NOT VALID (instant — enforces for new writes only), then VALIDATE CONSTRAINT (full scan with only a light lock). The supported route to NOT NULL on a big table goes through a NOT VALID check constraint.
Wrap all DDL in SET lock_timeout = '2s' + retry loops.

MySQL.

InnoDB online DDL (ALGORITHM=INPLACE/INSTANT) covers many cases, but still object-level and I/O-bound on huge tables.
gh-ost (GitHub) does the general case externally: create a ghost table with the new schema, copy rows in batches, apply ongoing changes by tailing the binlog (no triggers — unlike pt-online-schema-change, which uses triggers and adds write overhead), then atomically swap names. Pausable, throttleable on replica lag, and rehearsable.

The managed-platform versions of this (PlanetScale deploy requests, et al.) are the same ghost-table mechanics productized — with the addition of revertible schema changes, which is the direction of travel.

Migrations in the Delivery Pipeline

Migrations are code: versioned files (Flyway/Alembic/golang-migrate/Rails), applied by automation with an advisory lock so two deploys can't race (CI/CD & GitOps), recorded in a schema-history table. No human runs DDL in prod by hand.
Order of operations: expand-phase migrations apply before the code that uses them rolls out; contract-phase migrations apply after the code that stops using the old shape is fully rolled out and you've decided you won't roll back past it. Automate the first; gate the second on a human.
Lint in CI: reject unsafe operations (the table above) mechanically — squawk, strong_migrations, or custom rules. Add a CI check that every migration runs against a realistic-size dataset clone or at minimum EXPLAINs its locks.
Plan to roll forward. Auto-generated down-migrations either lose data (DROP COLUMN undone how?) or lie. Treat "down" as: ship a new forward migration that restores compatibility. The expand/contract discipline is what makes this safe — at every point, the previous code version still works, which is the only rollback that matters.
Don't mix DML and DDL in one migration file: schema steps are fast and lock-sensitive; data steps are slow and throttle-sensitive. Different tools, different supervision.

Checklist

[ ] Every migration is N−1 compatible (old code runs against new schema)
[ ] Breaking changes decomposed into expand → migrate → contract, each step separately deployed
[ ] lock_timeout + retry on all DDL; CONCURRENTLY / NOT VALID / gh-ost for big tables
[ ] Backfills: PK-range batched, idempotent, checkpointed, throttled on replica lag
[ ] Verification gate (counts/checksums/shadow reads) before read-switch; flags for gradual cutover
[ ] Contract step delayed past the rollback horizon; downstream readers (views, CDC, reports) checked
[ ] CI lints migrations for unsafe operations; apply is automated and serialized
[ ] Rollback story = roll forward; tested on a clone, not believed on faith

References

gh-ost — GitHub's triggerless online schema change; the design doc is a masterclass
pt-online-schema-change — the trigger-based predecessor
strong_migrations / squawk — unsafe-operation linters encoding the rules above
PostgreSQL: lock levels of DDL and Braintree: PostgreSQL migrations without downtime
Evolutionary Database Design — Fowler & Sadalage; expand/contract origins

Database Schema Migrations ​

TL;DR ​

Why Naive Migrations Cause Outages ​

Operation safety reference ​

Expand → Migrate → Contract ​

Backfills at scale ​

Online DDL Mechanics ​

Migrations in the Delivery Pipeline ​

Checklist ​

References ​