Status
Proposed on 2026-03-14 by Lars Solem.
Context
Dataverket intends to run across two or more datacenters and use NATS as the standard communication path between them.
That requires a concrete v1 stance on:
- how datacenters are modeled
- what kind of failover the platform supports
- how much cross-site consistency services may assume
Without this, the platform risks mixing single-site assumptions with multi-site marketing language.
Decision
Dataverket treats each datacenter as an explicit failure domain and adopts a site-active, service-level active/passive failover model for v1.
NATS is the standard internal control-plane transport within each datacenter and the standard inter-datacenter control-plane transport between sites.
NATS is an internal platform transport. It is not the public product integration surface exposed to tenants or external automation clients.
The v1 cross-site model is:
- two or more datacenters
- both datacenters may be active at the same time
- independent site-local operation where possible
- explicit inter-site links for control-plane and replication traffic
- service-specific failover rather than global transparent magic
Why site-active with service-level active/passive
The goal is to avoid wasting an entire datacenter as a permanently idle standby while still keeping failover behavior understandable.
This model allows:
- datacenter A to be active for some projects or services
- datacenter B to be active for other projects or services
- each individual service deployment to still use a clear active/passive failover posture where needed
The platform needs a realistic first step. Active/active multi-site systems are significantly harder because they require:
- stronger consistency design
- more complex split-brain handling
- more advanced traffic steering
- deeper service-by-service reconciliation logic
A site-active, service-level active/passive model allows Dataverket to use both datacenters productively while keeping failover semantics explicit and operable.
Control-plane communication
NATS is the standard communication path between datacenters for:
- control-plane coordination
- task and lifecycle signaling
- failover orchestration
- site-aware service events
The platform should prefer site-local service behavior where possible and use cross-site NATS for explicit coordination, replication signaling, and orchestrated failover transitions rather than for pretending the whole platform is one low-latency site.
Services must assume that inter-site links can be:
- slow
- partitioned
- temporarily unavailable
No service may assume globally consistent, zero-latency cross-site messaging.
Site model
Each datacenter must have:
- its own local network fabric
- its own local inventory representation
- its own local compute and storage capacity
- explicit site identity in API, inventory, and event metadata
Shared resources that span sites must model that fact explicitly rather than pretending they are single-site.
Failover model
The v1 failover model is service-specific:
- both datacenters may host active workloads at the same time
- for a given project, environment, or service instance, one datacenter is primary and another may be standby
- control-plane services may have a designated primary site and one or more standby sites
- tenant workloads may be recreated or reactivated in another datacenter when the product supports it
- routing and traffic movement are orchestrated transitions, not implicit assumptions
Failover is therefore not defined as “everything keeps running automatically everywhere”. It is defined as “the platform can move or restore supported services to another datacenter in a controlled way”.
Service categories
The v1 service categories are:
Category A: control-plane critical
Examples:
- Sentral
- Identitet integration components
- NATS infrastructure
These need explicit primary/standby or quorum-aware design from the beginning.
Category B: replicated platform services
Examples:
- container registry metadata and images
- object storage control state
- PostgreSQL service control plane
These should support replication-aware design, but may still expose active/passive failover in v1.
Category C: tenant workloads
Examples:
- VMs
- Kubernetes clusters
- apps
These should be placeable in a chosen datacenter, with different projects or workloads allowed to use different primary datacenters. Cross-site failover support is added per product capability rather than assumed universally.
NATS implications
Inter-datacenter NATS is part of the platform baseline.
That means:
- site identity must be present in message metadata where relevant
- failover workflows must be expressible through NATS commands and events
- consumers must tolerate delayed, duplicated, or replayed cross-site messages
The architecture does not require one globally uniform behavior for all subjects. A later topology ADR must classify which streams and subjects are:
- site-local only
- replicated or mirrored between sites
- forwarded cross-site only for explicit coordination workflows
NATS is the coordination layer, not proof that application data itself is replicated correctly.
Routing implications
Nett must support:
- routed inter-datacenter links
- site-aware traffic steering
- failover-triggered route and policy changes
- explicit separation between local east-west traffic and inter-site failover traffic
Data implications
This ADR does not claim that all data-bearing services are automatically multi-site safe.
Each stateful product still needs its own durability and replication design. In particular:
- object storage replication must be defined separately
- database replication and promotion must be defined separately
- VM disk replication policy must be defined separately
- Kubernetes cluster failover semantics must be defined separately
Operational model
Failover should be modeled as an orchestrated workflow with:
- detection input
- operator visibility
- explicit task tracking
- reversible or well-audited transitions where possible
The platform may automate parts of failover, but v1 should avoid hidden autonomous behavior that is hard to reason about during outages.
Testability requirements
The multi-datacenter posture must be testable in environments that can simulate more than one site and inject inter-site failure.
At minimum, the architecture should support repeated testing of:
- inter-site partition
- site-local service loss
- delayed or replayed cross-site messages
- failover triggering and operator approval paths
- recovery and rejoin after the failed site returns
Single-site integration tests and small hardware labs are not sufficient evidence for multi-datacenter failover claims on their own.
Consequences
- multi-datacenter is now a real architectural constraint rather than a future option
- both datacenters can carry live workload in v1
- active/passive remains the default failover posture at the service or workload level
- NATS becomes a core inter-site dependency for platform coordination
- every major stateful service will need a follow-up replication or promotion design
- multi-site capabilities now require explicit failover and partition test coverage
Decision Outcome
Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.
Related Decisions
- This ADR builds on the NATS transport rules in 007-nats-subject-and-event-envelope.md.
- The network implications described here depend on the topology and service boundaries in 005-network-service-and-topology.md.
- Platform selection for switches and routers must satisfy the failover expectations defined here, as described in 013-network-platform-selection-criteria.md.
- Any eventual VM runtime choice must also be evaluated against this datacenter and failover model, as described in 011-vm-runtime-selection.md.
More Information
- inter-datacenter NATS topology
- database replication and promotion strategy
- object storage replication strategy
- VM disk replication and recovery model
- Kubernetes multi-site placement and failover model
Audit
- 2026-03-14: ADR proposed.