Skip to content

Workflow System Fundamentals

Workflow systems coordinate work that is too slow, failure-prone, stateful, or cross-service to keep inside a single request. A good workflow system makes progress durable: every state transition is recorded, every retry is bounded, and every external side effect is tied to an idempotency strategy. The core design problem is not "run code later"; it is "advance a business process despite crashes, duplicate delivery, partial failure, and changing code."

The Problem

Request/response systems are a bad fit for long-running work:

  • The client timeout is shorter than the business process.
  • One logical operation touches many services.
  • A step may need to wait for time, inventory, payment settlement, human approval, or a third-party callback.
  • Failures happen after some side effects already committed.
  • Operators need to answer "where is this order/import/model training run now?"

You can solve the first version with a queue. You need a workflow system when the queue message becomes a process with state.

Workflow vs Queue vs DAG

PatternBest forState ownerTypical failure mode
Message queueIndependent background tasksConsumer code plus queue offsetDuplicate execution, poison messages
Workflow engineStateful business processesDurable workflow historyNon-deterministic replay, bad compensation
DAG orchestratorData or batch dependency graphsScheduler metadata DBBackfill storms, stuck dependencies
Distributed cronTime-triggered jobsSchedule store and lease ownerMissed ticks, duplicate ticks

These overlap. A workflow engine often uses queues internally; a DAG system is a workflow engine specialized for dependency graphs; distributed cron often starts workflows.

Core Model

A workflow has four persistent objects:

ObjectPurpose
Workflow definitionThe versioned code or declarative graph that describes allowed transitions
Workflow instanceA single run, such as order-123-refund or daily-report-2026-06-15
HistoryAppend-only events: started, step scheduled, step completed, timer fired, failed
Work itemA leaseable task a worker can execute now

The important design choice is that history is the source of truth. A queue item may disappear, a worker may die, and an API process may restart. The history remains the durable proof of what happened.

State Machine

Terminal states must be explicit. "No queue messages left" is not a reliable terminal condition because messages can be delayed, lost by a bug, or waiting in another partition.

Guarantees

Most workflow systems can provide:

  • Durable start: once the start call commits, the workflow can be recovered.
  • At-least-once activity execution: an activity may run more than once after worker crash or timeout.
  • Exactly-once state transition: the workflow history should record each logical transition once.
  • Durable timers: waiting does not require a live process.
  • Recoverable progress: a scheduler can rebuild runnable work from history.

They usually cannot provide exactly-once external side effects. Use idempotency, fencing, dedupe tables, and compensating actions.

Reference Architecture

Separate the control plane from execution. The control plane decides what should run; workers do the side-effecting work. This separation lets you scale workers independently, drain worker pools during deploys, and recover runnable tasks from metadata.

Design Questions

  1. What is the unit of idempotency: workflow, step, side effect, or request?
  2. Which state transitions must be strongly consistent?
  3. How long can a workflow run?
  4. Can old workflow definitions continue after new code deploys?
  5. What happens when a worker dies after doing the external side effect but before recording completion?
  6. What is the maximum fan-out per workflow instance?
  7. Who owns cancellation semantics?
  8. How will operators inspect, replay, and repair a stuck instance?

Storage Model

Minimal relational schema:

sql
CREATE TABLE workflow_instances (
  id TEXT PRIMARY KEY,
  type TEXT NOT NULL,
  status TEXT NOT NULL,
  version INT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL,
  updated_at TIMESTAMPTZ NOT NULL
);

CREATE TABLE workflow_events (
  workflow_id TEXT NOT NULL,
  sequence BIGINT NOT NULL,
  event_type TEXT NOT NULL,
  payload JSONB NOT NULL,
  created_at TIMESTAMPTZ NOT NULL,
  PRIMARY KEY (workflow_id, sequence)
);

CREATE TABLE workflow_tasks (
  id TEXT PRIMARY KEY,
  workflow_id TEXT NOT NULL,
  task_type TEXT NOT NULL,
  status TEXT NOT NULL,
  run_after TIMESTAMPTZ NOT NULL,
  lease_owner TEXT,
  lease_until TIMESTAMPTZ
);

The workflow_events primary key gives each transition a total order per workflow. That order is what makes replay and debugging possible.

Failure Modes

FailureSymptomMitigation
Worker crash after side effectStep repeats and charges/sends twiceIdempotency keys at the external service boundary
Lost task enqueueWorkflow stuck despite valid next stepReconciliation scans history and regenerates tasks
Timer service outageDelayed executionStore timers durably; scan overdue timers
Hot workflow fan-outScheduler overloadFan-out limits, child workflows, batching
Non-deterministic workflow codeReplay divergesVersioned code paths, deterministic APIs
Unbounded historySlow replay and large rowsContinue-as-new, snapshots, history compaction

Operational Metrics

  • Workflow start rate and completion rate.
  • Running, waiting, failed, and stuck instance counts.
  • Scheduler lag: now minus oldest runnable task.
  • Timer lag: now minus oldest overdue timer.
  • Activity retry rate by type and error class.
  • Queue depth and age by task queue.
  • Worker lease expirations.
  • History append latency and conflict rate.
  • Manual repair count.

When to Use

Use a workflow system when:

  • A process spans multiple services or commits.
  • Work must survive process restarts and deploys.
  • Waiting is part of the business logic.
  • Operators need a durable audit trail.
  • Compensation is more realistic than a global transaction.

When Not to Use

Avoid a workflow engine when:

  • The task is short, stateless, and naturally idempotent.
  • A plain message queue with retries is enough.
  • You cannot define ownership for stuck workflow repair.
  • The team will hide arbitrary application logic inside workflows without versioning discipline.

A practical reference for distributed system design. Released under the MIT License.