Background Jobs and Worker Pools

Background job systems execute work outside the request path: emails, exports, image processing, billing reconciliation, cache warming, webhook delivery, and cleanup. The hard parts are not enqueueing and consuming. The hard parts are duplicate execution, capacity isolation, graceful shutdown, poison jobs, unbounded retries, and proving that old work is still making progress.

Mental Model

A job is a durable intent to perform work. A worker lease is a temporary right to attempt it.

The queue is not the source of truth unless it can represent all job states, retries, attempts, and leases durably. Many production systems use both: a database for truth and a queue for wakeups.

Job Lifecycle

State	Meaning
Pending	Created but not runnable yet
Runnable	Eligible for a worker
Leased	A worker owns the attempt until `lease_until`
Succeeded	Terminal success
Retryable failed	Failed but will run again after backoff
Dead	Terminal failure after policy exhaustion
Canceled	Terminal cancellation before completion

Do not model every failure as a boolean. You need attempt count, next run time, error class, and last heartbeat for useful operations.

Worker Lease Pattern

sql

UPDATE jobs
SET lease_owner = :worker_id,
    lease_until = now() + interval '5 minutes',
    attempts = attempts + 1
WHERE id = (
  SELECT id
  FROM jobs
  WHERE status = 'runnable'
    AND run_after <= now()
    AND (lease_until IS NULL OR lease_until < now())
  ORDER BY priority DESC, run_after ASC
  FOR UPDATE SKIP LOCKED
  LIMIT 1
)
RETURNING *;

The lease gives recovery a clean rule: if the worker does not finish or heartbeat before expiration, another worker may retry. The activity itself must still be idempotent because the old worker might be slow rather than dead.

Queue Design

Design	Strength	Risk
Broker-only queue	Simple and high throughput	Harder to inspect and repair complex state
Database-backed jobs	Strong inspectability and transactions	Polling and locking can bottleneck
DB truth plus broker wakeup	Durable state plus responsive workers	More moving parts and reconciliation
Partitioned queues	High scale and isolation	Rebalancing and hot partition complexity

Worker Pool Sizing

Worker capacity should be set by bottleneck, not by queue depth alone.

Bottleneck	Scaling signal	Protection
CPU-bound jobs	CPU saturation and run duration	Worker autoscaling
DB-bound jobs	DB connections, lock waits, query latency	Concurrency caps per job type
Third-party API	429s, timeout rate, vendor quotas	Token bucket per integration
Memory-heavy jobs	RSS, OOM kills, spill rate	Job class isolation

Queue depth without age is misleading. A queue with 1M tiny jobs may be healthy; a queue with 10 old payment jobs may be an incident.

Graceful Shutdown

Workers need a deploy contract:

Stop accepting new leases.
Finish current jobs within a drain window.
Heartbeat long jobs while draining.
Release or let expire unfinished leases.
Persist enough progress for resumed execution.

If deploys kill workers abruptly, every deploy becomes a duplicate-execution test.

Retry Policy

Retries should be explicit by error class.

Error	Retry?	Policy
Network timeout	Yes	Exponential backoff with jitter
Database deadlock	Yes	Short bounded retry
429 rate limit	Yes	Backoff from `Retry-After` or quota state
Validation error	No	Mark dead with clear reason
Missing dependency	Maybe	Retry only if dependency can appear later

Use idempotency for all side effects. Retries without idempotency are data corruption with a delay.

Poison Jobs

A poison job always fails and consumes worker capacity forever unless isolated.

Mitigations:

Maximum attempts.
Dead letter queue with reason and payload.
Per-error-class retry budgets.
Circuit breaker for failing downstreams.
Quarantine queues for suspicious job types.
Manual replay tooling after code or data repair.

Backpressure

Producer admission is part of the job system. If producers can enqueue infinite work, the queue becomes a latency debt ledger.

Use backpressure from workers and downstreams to shape producer behavior.

Operational Metrics

Queue depth by job type, tenant, and priority.
Oldest runnable job age.
Job duration histogram by type.
Attempts per success.
Retry rate and dead letter rate.
Lease expiration count.
Worker utilization and active leases.
Downstream latency and throttling.
Time to drain during deploy.

Failure Modes

Failure	Cause	Fix
Duplicate side effects	Lease expires while old worker still runs	Idempotency key and fencing token
Queue starvation	High-priority jobs never stop	Aging, quotas, fair scheduling
Retry storm	Dependency outage triggers synchronized retries	Jitter, circuit breaker, global retry budget
Hidden stuck jobs	Only queue depth is monitored	Alert on oldest age and terminal state ratios
Worker pool collapse	One job class exhausts memory or DB connections	Isolate pools and set per-type concurrency

Background Jobs and Worker Pools ​

Mental Model ​

Job Lifecycle ​

Worker Lease Pattern ​

Queue Design ​

Worker Pool Sizing ​

Graceful Shutdown ​

Retry Policy ​

Poison Jobs ​

Backpressure ​

Operational Metrics ​

Failure Modes ​

Related Patterns ​