Skip to content

Distributed Cron and Scheduling

Distributed cron turns time into work across a cluster. It sounds simple until clocks skew, deploys restart schedulers, daylight saving time changes local schedules, jobs run longer than their interval, and two regions both think they own the same tick. A production scheduler needs durable schedules, leases, missed-run policy, jitter, backpressure, and clear semantics for duplicate ticks.

The Problem

Classic cron assumes one machine:

  • Local disk stores the crontab.
  • Local clock defines time.
  • Local process starts work.
  • Local logs are enough for audit.

Distributed systems break those assumptions. A schedule may have millions of tenants, region-specific time zones, failover, and SLA-backed execution windows.

Scheduling Model

The scheduler should create a durable run record before work begins. The run record is the audit trail and dedupe key for the tick.

Schedule vs Run

ObjectExampleMutability
Schedule"Every day at 09:00 Asia/Tokyo"Mutable by users or config
Run"schedule A for 2026-06-15T00:00Z"Immutable identity, mutable status

Never use "current time rounded to minute" alone as the identity. Use (schedule_id, scheduled_time) so duplicate schedulers converge on the same run.

Lease-Based Tick Claiming

sql
INSERT INTO scheduled_runs (schedule_id, scheduled_at, status, created_at)
VALUES (:schedule_id, :scheduled_at, 'created', now())
ON CONFLICT (schedule_id, scheduled_at) DO NOTHING;

This turns duplicate detection into a database constraint. A scheduler crash after creating the run is recoverable because a reconciler can enqueue all created-but-not-started runs.

Time Semantics

RequirementDesign
UTC interval jobsStore interval and next UTC fire time
Local business timeStore IANA time zone, not numeric offset
DST spring forwardDefine skip or shift policy
DST fall backDefine once or twice policy
End-of-monthDefine clamp or skip policy
SLA windowStore latest acceptable start time

Time policy must be user-visible. Hidden defaults become billing and compliance bugs.

Missed Runs

When the scheduler is down, it must decide what to do with missed ticks.

PolicyUse when
SkipFreshness matters more than completeness, such as cache refresh
Catch up allEvery period is legally or financially required
Catch up latest onlyState is overwritten, such as sync snapshot
Bounded catch-upOld work is useful only within a time window

Backfills should not share unlimited capacity with live ticks. Use a separate queue or priority class.

Jitter

If every tenant schedules at midnight, midnight becomes a self-inflicted incident.

Use deterministic jitter:

text
offset_seconds = hash(schedule_id) % jitter_window_seconds
actual_fire_time = nominal_fire_time + offset_seconds

Deterministic jitter preserves predictability while spreading load.

Sharding the Scheduler

StrategyStrengthRisk
Hash schedules by IDSimple and balancedHot tenants still hot
Shard by time bucketEfficient due scansHot buckets at common times
Shard by tenantIsolation and quota controlUneven tenant sizes
DB range scan with leasesEasy recoveryDB can become scheduler bottleneck

For large systems, shard schedule ownership and run creation separately. Schedule scans are read-heavy; run creation is write-heavy.

Multi-Region Scheduling

Active-active scheduling needs a single owner per schedule or a conflict-proof run identity.

Options:

  • Home-region per schedule.
  • Global database unique constraint.
  • Region-specific schedules with explicit failover.
  • Active-passive scheduler with warm standby.

If a job has external side effects, cross-region duplicate ticks are worse than late ticks. Prefer stable ownership and explicit failover.

Operational Metrics

  • Schedule scan lag.
  • Oldest due schedule not evaluated.
  • Run creation latency.
  • Duplicate run conflict count.
  • Missed run count by policy.
  • Catch-up backlog.
  • Scheduler shard ownership churn.
  • Tick-to-worker-start latency.

Failure Modes

FailureSymptomMitigation
Clock skewEarly or late ticksUse NTP discipline; compare against database/server time
Scheduler split brainDuplicate runsUnique run key and idempotent start
Catch-up stormWorkers saturated after outageBounded catch-up and separate queues
DST ambiguityDouble billing or missed reportsExplicit timezone policy
Long-running overlapSame job runs concurrentlyPer-schedule concurrency policy

A practical reference for distributed system design. Released under the MIT License.