Skip to content

Background Jobs and Worker Pools

Background job systems execute work outside the request path: emails, exports, image processing, billing reconciliation, cache warming, webhook delivery, and cleanup. The hard parts are not enqueueing and consuming. The hard parts are duplicate execution, capacity isolation, graceful shutdown, poison jobs, unbounded retries, and proving that old work is still making progress.

Mental Model

A job is a durable intent to perform work. A worker lease is a temporary right to attempt it.

The queue is not the source of truth unless it can represent all job states, retries, attempts, and leases durably. Many production systems use both: a database for truth and a queue for wakeups.

Job Lifecycle

StateMeaning
PendingCreated but not runnable yet
RunnableEligible for a worker
LeasedA worker owns the attempt until lease_until
SucceededTerminal success
Retryable failedFailed but will run again after backoff
DeadTerminal failure after policy exhaustion
CanceledTerminal cancellation before completion

Do not model every failure as a boolean. You need attempt count, next run time, error class, and last heartbeat for useful operations.

Worker Lease Pattern

sql
UPDATE jobs
SET lease_owner = :worker_id,
    lease_until = now() + interval '5 minutes',
    attempts = attempts + 1
WHERE id = (
  SELECT id
  FROM jobs
  WHERE status = 'runnable'
    AND run_after <= now()
    AND (lease_until IS NULL OR lease_until < now())
  ORDER BY priority DESC, run_after ASC
  FOR UPDATE SKIP LOCKED
  LIMIT 1
)
RETURNING *;

The lease gives recovery a clean rule: if the worker does not finish or heartbeat before expiration, another worker may retry. The activity itself must still be idempotent because the old worker might be slow rather than dead.

Queue Design

DesignStrengthRisk
Broker-only queueSimple and high throughputHarder to inspect and repair complex state
Database-backed jobsStrong inspectability and transactionsPolling and locking can bottleneck
DB truth plus broker wakeupDurable state plus responsive workersMore moving parts and reconciliation
Partitioned queuesHigh scale and isolationRebalancing and hot partition complexity

Worker Pool Sizing

Worker capacity should be set by bottleneck, not by queue depth alone.

BottleneckScaling signalProtection
CPU-bound jobsCPU saturation and run durationWorker autoscaling
DB-bound jobsDB connections, lock waits, query latencyConcurrency caps per job type
Third-party API429s, timeout rate, vendor quotasToken bucket per integration
Memory-heavy jobsRSS, OOM kills, spill rateJob class isolation

Queue depth without age is misleading. A queue with 1M tiny jobs may be healthy; a queue with 10 old payment jobs may be an incident.

Graceful Shutdown

Workers need a deploy contract:

  1. Stop accepting new leases.
  2. Finish current jobs within a drain window.
  3. Heartbeat long jobs while draining.
  4. Release or let expire unfinished leases.
  5. Persist enough progress for resumed execution.

If deploys kill workers abruptly, every deploy becomes a duplicate-execution test.

Retry Policy

Retries should be explicit by error class.

ErrorRetry?Policy
Network timeoutYesExponential backoff with jitter
Database deadlockYesShort bounded retry
429 rate limitYesBackoff from Retry-After or quota state
Validation errorNoMark dead with clear reason
Missing dependencyMaybeRetry only if dependency can appear later

Use idempotency for all side effects. Retries without idempotency are data corruption with a delay.

Poison Jobs

A poison job always fails and consumes worker capacity forever unless isolated.

Mitigations:

  • Maximum attempts.
  • Dead letter queue with reason and payload.
  • Per-error-class retry budgets.
  • Circuit breaker for failing downstreams.
  • Quarantine queues for suspicious job types.
  • Manual replay tooling after code or data repair.

Backpressure

Producer admission is part of the job system. If producers can enqueue infinite work, the queue becomes a latency debt ledger.

Use backpressure from workers and downstreams to shape producer behavior.

Operational Metrics

  • Queue depth by job type, tenant, and priority.
  • Oldest runnable job age.
  • Job duration histogram by type.
  • Attempts per success.
  • Retry rate and dead letter rate.
  • Lease expiration count.
  • Worker utilization and active leases.
  • Downstream latency and throttling.
  • Time to drain during deploy.

Failure Modes

FailureCauseFix
Duplicate side effectsLease expires while old worker still runsIdempotency key and fencing token
Queue starvationHigh-priority jobs never stopAging, quotas, fair scheduling
Retry stormDependency outage triggers synchronized retriesJitter, circuit breaker, global retry budget
Hidden stuck jobsOnly queue depth is monitoredAlert on oldest age and terminal state ratios
Worker pool collapseOne job class exhausts memory or DB connectionsIsolate pools and set per-type concurrency

A practical reference for distributed system design. Released under the MIT License.