Skip to content

Leases, Heartbeats, and Recovery

Workflow and job systems use leases to decide who is currently allowed to execute work. A lease is a time-bounded claim, not proof that every other worker stopped. Heartbeats extend that claim while work is still progressing. Recovery begins when a lease expires, but correctness still depends on idempotency and fencing because expired workers may continue running.

Lease vs Lock

MechanismMeaning
LockExclusive ownership until release, often dangerous if owner dies
LeaseExclusive ownership until a deadline
HeartbeatOwner proves it is still alive and extends the lease
Fencing tokenMonotonic token used by downstreams to reject stale owners

Leases are safer than indefinite locks because they recover automatically. They are not magic exactly-once execution.

Basic Lease Flow

If another worker later gets token 43, the downstream should reject stale writes with token 42.

Lease Duration

Lease too shortLease too long
Duplicate attempts during normal GC pauses or slow callsSlow recovery after worker death
Higher downstream duplicate pressureStuck jobs stay invisible longer
More heartbeat trafficFewer false expirations

Set lease duration from observed P99 heartbeat delay plus a safety margin, not from average job runtime.

Heartbeat Payloads

Heartbeats can store progress:

json
{
  "processed_records": 250000,
  "last_object_key": "logs/2026/06/15/part-00042.gz",
  "updated_at": "2026-06-15T10:30:00Z"
}

Progress lets a retry resume instead of starting from zero, but only if the task is designed around checkpoint boundaries.

Recovery Controller

Recovery should be a controller, not an ad hoc worker behavior. Controllers make stuck-state policy visible and testable.

Orphan Detection

An orphan is work that is neither terminal nor runnable.

Examples:

  • Job has no lease but status is running.
  • Workflow waits for a timer that was never stored.
  • Activity completed but completion event was not appended.
  • Parent workflow canceled but child jobs continue.

Run reconciliation jobs that scan for impossible states and repair or alert.

Fencing

Fencing protects shared resources from stale workers:

sql
UPDATE resources
SET value = :new_value,
    fencing_token = :token
WHERE id = :id
  AND fencing_token < :token;

If the update affects zero rows, the worker is stale and must stop.

Failure Modes

FailureCauseMitigation
Split executionLease expires while original worker continuesFencing and idempotent side effects
Lost heartbeatNetwork blipLease margin and retry heartbeat
Zombie workerProcess stuck but still heartbeatingProgress-based heartbeats, not just liveness
Stuck running stateWorker died before releasingExpired-lease scanner
Double completionTwo workers race to completeCompare-and-set terminal transition

Operational Metrics

  • Active leases.
  • Lease acquisition latency.
  • Heartbeat success rate.
  • Lease expiration count.
  • Recovery requeue count.
  • Fencing rejection count.
  • Orphaned job count.
  • Time from worker death to requeue.

A practical reference for distributed system design. Released under the MIT License.