Workflow retry, dead-letter, and reconciliation DATAVERKET 019

Proposed Integration Workflow Messaging Operations Retry Dead letter

Defines retry, dead-letter, timeout, and reconciliation policy for Dataverket's event-driven workflows.

Author
Lars Solem
Updated

Status

Proposed on 2026-03-14 by Lars Solem.

Context

Dataverket is explicitly building an event-driven control plane on top of NATS with at-least-once delivery and idempotent consumers.

That is necessary, but not sufficient. The platform still needs a concrete operational policy for:

  • retries
  • poison messages
  • stuck workflows
  • replay
  • reconciliation after restart or partial failure

Without that, the event model remains conceptually correct but operationally unsafe.

Workflow safety also depends on what happens when an upstream or downstream service is unavailable for a longer period, not only on what happens to individual messages.

Decision

Dataverket adopts a workflow handling model based on:

  • bounded retries for transient failures
  • explicit failed task state for repeated or permanent failures
  • dead-letter handling for poison work
  • reconciliation loops as the final correctness mechanism

The platform must not rely on infinite retries or manual ad hoc recovery as the default behavior.

This policy applies both to message handling failures and to cross-service dependency failures.

Retry model

The default retry posture is:

  • retry transient failures with bounded exponential backoff
  • stop retrying when the failure is clearly permanent or the retry budget is exhausted
  • record retry attempts in task state and operational visibility

Retries must be visible to operators. A silently thrashing workflow is not acceptable.

Permanent failure model

When a workflow cannot complete successfully within its retry policy, it must transition into a visible failure state.

That failure state must include enough context to answer:

  • what operation failed
  • which resource was affected
  • what error class was observed
  • whether the work is safe to retry

Permanent failure should not disappear into logs alone.

Dependency failure must be classified explicitly. A workflow blocked because another service is unavailable is not the same thing as a malformed request or an irrecoverable domain error.

Relevant distinctions include at least:

  • dependency unavailable
  • dependency degraded but usable in reduced mode
  • policy validation failure
  • inventory or prerequisite missing
  • permanent domain rejection

Dead-letter model

Poison messages and repeatedly failing work must be isolated from normal processing after bounded retry exhaustion.

The dead-letter model must preserve:

  • original message identity
  • correlation and causation context
  • failure history
  • last known error classification

Dead-letter is an operator-visible quarantine, not a silent discard path.

Replay model

Replay must be explicit and controlled.

That means:

  • replay is a deliberate operator or system action
  • replay should preserve correlation context
  • replay should not be triggered accidentally by normal restart behavior
  • replay must still be safe under idempotent handling assumptions

Reconciliation model

Reconciliation is the final correctness layer for Dataverket.

That means services must be able to:

  • compare desired state with actual state
  • detect partial application or missed events
  • converge resources toward intended state after interruption

Retries and dead-letter handling improve workflow safety, but reconciliation is what prevents the platform from drifting permanently after message loss, restart, or partial execution.

Reconciliation must also cover work that was blocked by a missing dependency and could not be completed at the original execution time.

Reconciliation is not permission to ignore concurrency control. If a scarce resource could only have one correct winner, the workflow must preserve that exclusivity at write time and let reconciliation handle only the aftermath of partial execution or interruption.

Timeout and cancellation model

Long-running tasks must support:

  • explicit timeout policies
  • explicit cancellation states
  • visibility into who or what cancelled the work
  • cleanup or compensating steps where appropriate

Timeouts must not leave workflow state ambiguous.

Operator visibility requirements

Operators must be able to inspect:

  • current retry state
  • dead-lettered work
  • recent failure history
  • reconciliation actions taken after failure or restart
  • whether manual intervention is required

This visibility must be exposed through supported APIs or operator tooling, not hidden in internal logs only.

Service responsibilities

Each service is responsible for:

  • classifying transient versus permanent failures as well as practical
  • classifying dependency failures separately from local execution failures
  • implementing idempotent handlers
  • preserving concurrency and exclusivity guarantees for its own authoritative resources
  • exposing enough state for task inspection
  • participating in reconciliation for its own resource domain
  • documenting whether its workflows can proceed in degraded mode when a dependency is impaired

Sentral is responsible for the cross-service task view, but domain services still own domain correctness.

Multi-datacenter implications

Because Dataverket is multi-datacenter, workflow handling must assume:

  • delayed cross-site delivery
  • temporary partitions
  • restart or failover of site-local workers
  • duplicate or late-arriving messages during recovery

Retry and reconciliation behavior must be safe under these conditions.

Explicit non-decisions for now

This ADR intentionally does not yet choose:

  • exact retry durations
  • exact dead-letter stream layout
  • final operator UI or CLI shape for replay
  • final error taxonomy

Those require later implementation detail or supporting ADRs.

Consequences

  • workflow handling becomes an explicit platform policy instead of service-by-service improvisation
  • operators gain a path to understand and recover failed work
  • reconciliation is elevated to a core correctness mechanism

Decision Outcome

Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.

More Information

  • error taxonomy and classification model
  • dead-letter stream and retention design
  • replay authorization and operator workflow
  • compensating action patterns for destructive operations

Audit

  • 2026-03-14: ADR proposed.