Skip to content

DAG Orchestration

DAG orchestration coordinates work with explicit dependencies: extract data, transform it, train a model, publish metrics, refresh search indexes, or run analytics backfills. A DAG system turns dependency structure into runnable tasks while tracking data intervals, retries, backfills, and partial completion. It is a workflow engine specialized for graph-shaped batch and data work.

DAG Model

A DAG should be acyclic because cycles make readiness ambiguous. Repeated work is represented by separate runs over intervals, not by graph cycles.

Terms

TermMeaning
DAG definitionVersioned graph of tasks and dependencies
DAG runOne execution of the graph, often for a data interval
TaskA node in the graph
Task attemptOne try of a task
SensorA task waiting for external state
BackfillRunning historical intervals

Scheduler Responsibilities

The scheduler decides when a task is runnable:

text
task is runnable when:
  all upstream tasks succeeded
  task run_after <= now
  task concurrency limits allow it
  DAG run is not canceled
  required external conditions are met

Workers should not decide dependency readiness. They should execute leased tasks.

Data Intervals

For data systems, the interval is part of correctness.

Run labelData intervalCommon bug
2026-06-15 daily run2026-06-15T00:00Z to 2026-06-16T00:00ZTreating run date as execution date
Hourly run at 10:0009:00 to 10:00Reading incomplete late data
Backfill for MayMany daily intervalsOverloading live warehouse capacity

Make interval boundaries explicit in task inputs and output paths.

Idempotent Outputs

DAG tasks should write outputs deterministically:

text
s3://warehouse/orders_daily/dt=2026-06-15/_tmp/run_id=abc
s3://warehouse/orders_daily/dt=2026-06-15/part-000.parquet

Write to a temporary location, validate, then atomically publish or swap metadata. This avoids half-written partitions after task failure.

Backfills

Backfills are production load, not maintenance trivia.

Controls:

  • Separate backfill queue.
  • Concurrency caps by DAG and dataset.
  • Warehouse budget limits.
  • Read-only dry run for dependency expansion.
  • Backfill pause/resume.
  • Output overwrite policy.

Dynamic DAGs

Dynamic task generation is useful for partitioned work, but it can overwhelm the scheduler.

PatternRiskControl
One task per customerMillions of tasksBatch customers into shards
One task per fileScheduler metadata explosionManifest task plus worker-side batching
Runtime graph expansionHard to reason about retriesPersist expanded graph per run

Failure Semantics

FailureDesired behavior
Upstream task failsDownstream tasks remain blocked or skipped
Task times outRetry if idempotent; otherwise fail fast
Data quality check failsStop publish path and alert owner
Worker diesAttempt is retried after lease expiration
Scheduler diesReconstruct runnable tasks from metadata

Observability

DAG observability needs both graph and data views:

  • Critical path duration.
  • Task duration by attempt.
  • Queue wait vs execution time.
  • Failed dependency count.
  • Late data count.
  • Backfill progress by interval.
  • Dataset freshness.
  • Output row counts and quality checks.

When to Use

Use DAG orchestration for batch workflows, data pipelines, ML training pipelines, and dependency-heavy jobs. For request-time orchestration, use a workflow engine or service orchestration pattern instead.

A practical reference for distributed system design. Released under the MIT License.