Status
Proposed on 2026-03-14 by Lars Solem.
Context
Dataverket already has a multi-datacenter failover model, PostgreSQL backup expectations, and storage durability guidance.
What is still missing is an explicit end-to-end disaster recovery posture across the whole platform. Failover and disaster recovery are related, but they are not the same thing.
A service may fail over while still losing state, and a platform may recover data without being able to resume control-plane operation quickly. The architecture needs one clear DR frame that spans those cases.
Decision
Dataverket treats disaster recovery as an explicit platform-wide capability, not a database-only concern.
The disaster recovery model must cover:
- control-plane state recovery
- workflow transport recovery
- inventory and approval state recovery
- infrastructure configuration recovery
- secrets and credential-store recovery
- product-specific data recovery expectations
DR versus failover
Dataverket distinguishes clearly between:
Failover Moving active responsibility to another site or component during an outage.
Disaster recovery Restoring platform capability and data after loss, corruption, or extended unavailability.
The platform must not use those terms interchangeably.
DR scope
The DR posture must account for at least:
- PostgreSQL control-plane databases
- NATS JetStream durable workflow state
- inventory graph and trust state
- pending approvals and operator workflow state
- Talos machine and cluster configuration inputs
- storage and persistence metadata
- secrets backend state
- network intent and rendered configuration history
Recovery classes
The platform should classify recovery needs at least into:
Control-plane critical Services without which operators lose control or visibility.
Platform capability critical Services required to provision, recover, or host tenant workloads.
Tenant data critical Product-specific data with explicit durability and recovery expectations.
This helps avoid pretending every system needs the same RPO and RTO.
For PostgreSQL-backed control-plane state, these classes should be applied per bounded context or service database rather than once for “the cluster” as a whole.
Backup requirements
Backups must be:
- regular
- observable
- restorable
- scoped to the systems they protect
- aligned with actual recovery priorities
Backups that cannot be restored under test conditions do not satisfy the DR requirement.
Recovery workflow requirements
The platform must define recovery workflows for:
- control-plane rebuild or restore
- database restore and promotion
- JetStream restore or recovery path
- secrets backend restore
- inventory and approval state restore
- operator recovery visibility during degraded conditions
Recovery should be modeled as an operational workflow, not as tribal knowledge.
Recovery claims should also be exercised in test environments that can simulate site loss, cross-site rebuild, and delayed restoration rather than relying only on document review.
Multi-datacenter requirements
Because Dataverket is multi-datacenter, DR must explicitly define:
- what survives the loss of one datacenter
- what must be restored from backup rather than failed over
- whether recovery happens in-place, cross-site, or by rebuilding elsewhere
- what operator actions are required during site loss
The platform should never imply that “multi-datacenter” automatically means “fully disaster-proof”.
Exercising recovery
The DR posture must include recurring exercises for:
- loss of one datacenter
- restore from backup when failover is not sufficient
- promotion or rebuild in an alternate site
- operator runbook execution during partial or prolonged outage
These exercises may use simulation or staged environments rather than production, but they must be realistic enough to validate the stated recovery path.
Operator visibility requirements
Operators must be able to answer:
- what data and state are protected
- where backups or replicas exist
- what the expected recovery path is
- which systems are currently in a recoverable versus non-recoverable posture
DR posture must be inspectable, not assumed.
Explicit non-decisions for now
This ADR intentionally does not yet choose:
- exact RPO and RTO targets
- exact backup tooling across all domains
- exact cross-site restore orchestration
- exact product-specific DR guarantees
Those require later selection and product-specific ADRs.
Consequences
- disaster recovery is now wider than PostgreSQL alone
- platform claims about resilience must be backed by actual recovery design
- service teams must think about recoverability as part of architecture, not only availability
Decision Outcome
Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.
Related Decisions
- Multi-datacenter posture must align with 012-inter-datacenter-topology-and-failover.md.
- PostgreSQL recovery must align with 018-postgresql-control-plane-ha-and-backup.md.
- Storage durability and restore behavior must align with 015-storage-platform-and-persistence-strategy.md.
- Operator visibility must align with 022-operator-visibility-and-control-surface.md.
- Upgrade and restore sequencing must align with 025-upgrade-and-migration-strategy.md.
More Information
- RPO and RTO targets by recovery class
- JetStream backup and recovery strategy
- secrets backend backup and recovery strategy
- infrastructure configuration backup and restore strategy
Audit
- 2026-03-14: ADR proposed.