End-to-end disaster recovery DATAVERKET 026

Proposed Infrastructure Recovery Disaster recovery Availability Backup Failover

Defines end-to-end disaster recovery as a platform-wide capability spanning control-plane state, workflow transport, secrets, and operator recovery workflows.

Author: Lars Solem
Updated: 2026-03-14

Context

Dataverket already has a multi-datacenter failover model, PostgreSQL backup expectations, and storage durability guidance.

What is still missing is an explicit end-to-end disaster recovery posture across the whole platform. Failover and disaster recovery are related, but they are not the same thing.

A service may fail over while still losing state, and a platform may recover data without being able to resume control-plane operation quickly. The architecture needs one clear DR frame that spans those cases.

Decision

Dataverket treats disaster recovery as an explicit platform-wide capability, not a database-only concern.

The disaster recovery model must cover:

control-plane state recovery
workflow transport recovery
inventory and approval state recovery
infrastructure configuration recovery
secrets and credential-store recovery
product-specific data recovery expectations

DR versus failover

Dataverket distinguishes clearly between:

Failover Moving active responsibility to another site or component during an outage.
Disaster recovery Restoring platform capability and data after loss, corruption, or extended unavailability.

The platform must not use those terms interchangeably.

DR scope

The DR posture must account for at least:

PostgreSQL control-plane databases
NATS JetStream durable workflow state
inventory graph and trust state
pending approvals and operator workflow state
Talos machine and cluster configuration inputs
storage and persistence metadata
secrets backend state
network intent and rendered configuration history

Recovery classes

The platform should classify recovery needs at least into:

Control-plane critical Services without which operators lose control or visibility.
Platform capability critical Services required to provision, recover, or host tenant workloads.
Tenant data critical Product-specific data with explicit durability and recovery expectations.

This helps avoid pretending every system needs the same RPO and RTO.

For PostgreSQL-backed control-plane state, these classes should be applied per bounded context or service database rather than once for “the cluster” as a whole.

Backup requirements

Backups must be:

regular
observable
restorable
scoped to the systems they protect
aligned with actual recovery priorities

Backups that cannot be restored under test conditions do not satisfy the DR requirement.

Recovery workflow requirements

The platform must define recovery workflows for:

control-plane rebuild or restore
database restore and promotion
JetStream restore or recovery path
secrets backend restore
inventory and approval state restore
operator recovery visibility during degraded conditions

Recovery should be modeled as an operational workflow, not as tribal knowledge.

Recovery claims should also be exercised in test environments that can simulate site loss, cross-site rebuild, and delayed restoration rather than relying only on document review.

Multi-datacenter requirements

Because Dataverket is multi-datacenter, DR must explicitly define:

what survives the loss of one datacenter
what must be restored from backup rather than failed over
whether recovery happens in-place, cross-site, or by rebuilding elsewhere
what operator actions are required during site loss

The platform should never imply that “multi-datacenter” automatically means “fully disaster-proof”.

Exercising recovery

The DR posture must include recurring exercises for:

loss of one datacenter
restore from backup when failover is not sufficient
promotion or rebuild in an alternate site
operator runbook execution during partial or prolonged outage

These exercises may use simulation or staged environments rather than production, but they must be realistic enough to validate the stated recovery path.

Operator visibility requirements

Operators must be able to answer:

what data and state are protected
where backups or replicas exist
what the expected recovery path is
which systems are currently in a recoverable versus non-recoverable posture

DR posture must be inspectable, not assumed.

Explicit non-decisions for now

This ADR intentionally does not yet choose:

exact RPO and RTO targets
exact backup tooling across all domains
exact cross-site restore orchestration
exact product-specific DR guarantees

Those require later selection and product-specific ADRs.

Consequences

disaster recovery is now wider than PostgreSQL alone
platform claims about resilience must be backed by actual recovery design
service teams must think about recoverability as part of architecture, not only availability

Decision Outcome

Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.

Multi-datacenter posture must align with 012-inter-datacenter-topology-and-failover.md.
PostgreSQL recovery must align with 018-postgresql-control-plane-ha-and-backup.md.
Storage durability and restore behavior must align with 015-storage-platform-and-persistence-strategy.md.
Operator visibility must align with 022-operator-visibility-and-control-surface.md.
Upgrade and restore sequencing must align with 025-upgrade-and-migration-strategy.md.

More Information

RPO and RTO targets by recovery class
JetStream backup and recovery strategy
secrets backend backup and recovery strategy
infrastructure configuration backup and restore strategy

Audit

2026-03-14: ADR proposed.