End-to-end disaster recovery DATAVERKET 026

Proposed Infrastructure Recovery Disaster recovery Availability Backup Failover

Defines end-to-end disaster recovery as a platform-wide capability spanning control-plane state, workflow transport, secrets, and operator recovery workflows.

Author
Lars Solem
Updated

Status

Proposed on 2026-03-14 by Lars Solem.

Context

Dataverket already has a multi-datacenter failover model, PostgreSQL backup expectations, and storage durability guidance.

What is still missing is an explicit end-to-end disaster recovery posture across the whole platform. Failover and disaster recovery are related, but they are not the same thing.

A service may fail over while still losing state, and a platform may recover data without being able to resume control-plane operation quickly. The architecture needs one clear DR frame that spans those cases.

Decision

Dataverket treats disaster recovery as an explicit platform-wide capability, not a database-only concern.

The disaster recovery model must cover:

  • control-plane state recovery
  • workflow transport recovery
  • inventory and approval state recovery
  • infrastructure configuration recovery
  • secrets and credential-store recovery
  • product-specific data recovery expectations

DR versus failover

Dataverket distinguishes clearly between:

  • Failover Moving active responsibility to another site or component during an outage.

  • Disaster recovery Restoring platform capability and data after loss, corruption, or extended unavailability.

The platform must not use those terms interchangeably.

DR scope

The DR posture must account for at least:

  • PostgreSQL control-plane databases
  • NATS JetStream durable workflow state
  • inventory graph and trust state
  • pending approvals and operator workflow state
  • Talos machine and cluster configuration inputs
  • storage and persistence metadata
  • secrets backend state
  • network intent and rendered configuration history

Recovery classes

The platform should classify recovery needs at least into:

  1. Control-plane critical Services without which operators lose control or visibility.

  2. Platform capability critical Services required to provision, recover, or host tenant workloads.

  3. Tenant data critical Product-specific data with explicit durability and recovery expectations.

This helps avoid pretending every system needs the same RPO and RTO.

For PostgreSQL-backed control-plane state, these classes should be applied per bounded context or service database rather than once for “the cluster” as a whole.

Backup requirements

Backups must be:

  • regular
  • observable
  • restorable
  • scoped to the systems they protect
  • aligned with actual recovery priorities

Backups that cannot be restored under test conditions do not satisfy the DR requirement.

Recovery workflow requirements

The platform must define recovery workflows for:

  • control-plane rebuild or restore
  • database restore and promotion
  • JetStream restore or recovery path
  • secrets backend restore
  • inventory and approval state restore
  • operator recovery visibility during degraded conditions

Recovery should be modeled as an operational workflow, not as tribal knowledge.

Recovery claims should also be exercised in test environments that can simulate site loss, cross-site rebuild, and delayed restoration rather than relying only on document review.

Multi-datacenter requirements

Because Dataverket is multi-datacenter, DR must explicitly define:

  • what survives the loss of one datacenter
  • what must be restored from backup rather than failed over
  • whether recovery happens in-place, cross-site, or by rebuilding elsewhere
  • what operator actions are required during site loss

The platform should never imply that “multi-datacenter” automatically means “fully disaster-proof”.

Exercising recovery

The DR posture must include recurring exercises for:

  • loss of one datacenter
  • restore from backup when failover is not sufficient
  • promotion or rebuild in an alternate site
  • operator runbook execution during partial or prolonged outage

These exercises may use simulation or staged environments rather than production, but they must be realistic enough to validate the stated recovery path.

Operator visibility requirements

Operators must be able to answer:

  • what data and state are protected
  • where backups or replicas exist
  • what the expected recovery path is
  • which systems are currently in a recoverable versus non-recoverable posture

DR posture must be inspectable, not assumed.

Explicit non-decisions for now

This ADR intentionally does not yet choose:

  • exact RPO and RTO targets
  • exact backup tooling across all domains
  • exact cross-site restore orchestration
  • exact product-specific DR guarantees

Those require later selection and product-specific ADRs.

Consequences

  • disaster recovery is now wider than PostgreSQL alone
  • platform claims about resilience must be backed by actual recovery design
  • service teams must think about recoverability as part of architecture, not only availability

Decision Outcome

Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.

More Information

  • RPO and RTO targets by recovery class
  • JetStream backup and recovery strategy
  • secrets backend backup and recovery strategy
  • infrastructure configuration backup and restore strategy

Audit

  • 2026-03-14: ADR proposed.