Workflow retry, dead-letter, and reconciliation DATAVERKET 019

Proposed Integration Workflow Messaging Operations Retry Dead letter

Defines retry, dead-letter, timeout, and reconciliation policy for Dataverket's event-driven workflows.

Author: Lars Solem
Updated: 2026-03-14

Context

Dataverket is explicitly building an event-driven control plane on top of NATS with at-least-once delivery and idempotent consumers.

That is necessary, but not sufficient. The platform still needs a concrete operational policy for:

retries
poison messages
stuck workflows
replay
reconciliation after restart or partial failure

Without that, the event model remains conceptually correct but operationally unsafe.

Workflow safety also depends on what happens when an upstream or downstream service is unavailable for a longer period, not only on what happens to individual messages.

Decision

Dataverket adopts a workflow handling model based on:

bounded retries for transient failures
explicit failed task state for repeated or permanent failures
dead-letter handling for poison work
reconciliation loops as the final correctness mechanism

The platform must not rely on infinite retries or manual ad hoc recovery as the default behavior.

This policy applies both to message handling failures and to cross-service dependency failures.

Retry model

The default retry posture is:

retry transient failures with bounded exponential backoff
stop retrying when the failure is clearly permanent or the retry budget is exhausted
record retry attempts in task state and operational visibility

Retries must be visible to operators. A silently thrashing workflow is not acceptable.

Permanent failure model

When a workflow cannot complete successfully within its retry policy, it must transition into a visible failure state.

That failure state must include enough context to answer:

what operation failed
which resource was affected
what error class was observed
whether the work is safe to retry

Permanent failure should not disappear into logs alone.

Dependency failure must be classified explicitly. A workflow blocked because another service is unavailable is not the same thing as a malformed request or an irrecoverable domain error.

Relevant distinctions include at least:

dependency unavailable
dependency degraded but usable in reduced mode
policy validation failure
inventory or prerequisite missing
permanent domain rejection

Dead-letter model

Poison messages and repeatedly failing work must be isolated from normal processing after bounded retry exhaustion.

The dead-letter model must preserve:

original message identity
correlation and causation context
failure history
last known error classification

Dead-letter is an operator-visible quarantine, not a silent discard path.

Replay model

Replay must be explicit and controlled.

That means:

replay is a deliberate operator or system action
replay should preserve correlation context
replay should not be triggered accidentally by normal restart behavior
replay must still be safe under idempotent handling assumptions

Reconciliation model

Reconciliation is the final correctness layer for Dataverket.

That means services must be able to:

compare desired state with actual state
detect partial application or missed events
converge resources toward intended state after interruption

Retries and dead-letter handling improve workflow safety, but reconciliation is what prevents the platform from drifting permanently after message loss, restart, or partial execution.

Reconciliation must also cover work that was blocked by a missing dependency and could not be completed at the original execution time.

Reconciliation is not permission to ignore concurrency control. If a scarce resource could only have one correct winner, the workflow must preserve that exclusivity at write time and let reconciliation handle only the aftermath of partial execution or interruption.

Timeout and cancellation model

Long-running tasks must support:

explicit timeout policies
explicit cancellation states
visibility into who or what cancelled the work
cleanup or compensating steps where appropriate

Timeouts must not leave workflow state ambiguous.

Operator visibility requirements

Operators must be able to inspect:

current retry state
dead-lettered work
recent failure history
reconciliation actions taken after failure or restart
whether manual intervention is required

This visibility must be exposed through supported APIs or operator tooling, not hidden in internal logs only.

Service responsibilities

Each service is responsible for:

classifying transient versus permanent failures as well as practical
classifying dependency failures separately from local execution failures
implementing idempotent handlers
preserving concurrency and exclusivity guarantees for its own authoritative resources
exposing enough state for task inspection
participating in reconciliation for its own resource domain
documenting whether its workflows can proceed in degraded mode when a dependency is impaired

Sentral is responsible for the cross-service task view, but domain services still own domain correctness.

Multi-datacenter implications

Because Dataverket is multi-datacenter, workflow handling must assume:

delayed cross-site delivery
temporary partitions
restart or failover of site-local workers
duplicate or late-arriving messages during recovery

Retry and reconciliation behavior must be safe under these conditions.

Explicit non-decisions for now

This ADR intentionally does not yet choose:

exact retry durations
exact dead-letter stream layout
final operator UI or CLI shape for replay
final error taxonomy

Those require later implementation detail or supporting ADRs.

Consequences

workflow handling becomes an explicit platform policy instead of service-by-service improvisation
operators gain a path to understand and recover failed work
reconciliation is elevated to a core correctness mechanism

Decision Outcome

Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.

This ADR expands the baseline set in 007-nats-subject-and-event-envelope.md.
Task inspection and operator visibility must fit 008-public-api-style.md and 017-observability-and-operations-baseline.md.
Reconciliation depends on the resource and inventory model in 009-resource-inventory-and-tenancy-model.md.

More Information

error taxonomy and classification model
dead-letter stream and retention design
replay authorization and operator workflow
compensating action patterns for destructive operations

Audit

2026-03-14: ADR proposed.