Skip to content

Durable Execution and Workflow Engines

Durable execution systems run workflow code as if it were a normal program while recording enough history to recover after crashes. The engine persists decisions, timers, activity completions, and signals; on restart, it replays history to reconstruct state. This model powers long-running order flows, payment workflows, provisioning, incident automation, and human approval processes.

Core Idea

The workflow function is a deterministic state machine. Activities perform non-deterministic side effects.

Replay is what makes the process durable. It is also what makes determinism non-negotiable.

Workflow Code vs Activity Code

Code typeCan call network?Can read wall clock?Retry ownerDeterminism requirement
Workflow codeNoThrough engine API onlyEngineStrict
Activity codeYesYesActivity policyIdempotent

Workflow code decides. Activity code does.

Event History

Typical events:

  • WorkflowStarted
  • ActivityScheduled
  • ActivityStarted
  • ActivityCompleted
  • ActivityFailed
  • TimerStarted
  • TimerFired
  • SignalReceived
  • ChildWorkflowStarted
  • WorkflowCompleted

The history is an append-only log for one workflow instance. It gives deterministic replay, auditability, and recovery.

Durable Timers

Sleeping inside a process is not durable. A durable timer is a persisted event:

text
TimerStarted(id=payment-settlement, fire_at=2026-06-16T00:00:00Z)
...
TimerFired(id=payment-settlement)

No worker needs to stay alive while the workflow waits.

Versioning

Long-running workflows may outlive deploys. If replay executes new code against old history, behavior can diverge.

Safe strategies:

  • Version markers in workflow history.
  • Continue-as-new to move long histories onto new code.
  • Keep old workflow workers until old runs drain.
  • Avoid non-deterministic iteration over maps or unordered sets.
  • Route workflow types by version.

Side Effects

External side effects should live in activities with idempotency keys:

text
idempotency_key = workflow_id + ":" + activity_id + ":" + attempt_independent_operation

Do not include attempt number in the idempotency key for a logical side effect. Attempts are retries of the same intent.

Saga Integration

Durable workflow engines are a natural implementation of sagas:

The engine records which forward steps succeeded, so compensation can run exactly for those steps.

Scaling Considerations

PressureDesign response
Many waiting workflowsStore timers efficiently; do not keep workers hot
Huge historiesSnapshot or continue-as-new
Hot workflow instancesLimit signals and child fan-out
Activity throughputSeparate task queues per activity class
Worker deploysDrain and version workers
Multi-tenant loadNamespace quotas and per-tenant task queues

Failure Modes

FailureCauseMitigation
Non-deterministic replayWorkflow code reads random/time/networkDeterministic workflow APIs and replay tests
Activity completes but result lostWorker crashes after side effectIdempotency key and activity retry
History too largeLong loops or chatty signalsContinue-as-new and aggregate events
Version breakNew code cannot replay old historyVersion markers and compatibility tests
Stuck workflowWaiting for signal that never arrivesTimers, escalation, and operator repair

Operational Metrics

  • Workflow task replay latency.
  • History size by workflow type.
  • Activity schedule-to-start latency.
  • Activity success/retry/failure counts.
  • Timer backlog.
  • Signal rate.
  • Non-determinism failures.
  • Continue-as-new count.

When to Use

Use durable execution when the process is long-running, stateful, and business-critical. If all you need is "run this function later," a background job queue is easier.

A practical reference for distributed system design. Released under the MIT License.