Inter-datacenter topology and failover DATAVERKET 012

Proposed Infrastructure Topology Failover Multi datacenter Nats Recovery

Defines Dataverket's multi-datacenter failure-domain model and the initial site-active, service-level active/passive failover posture.

Author: Lars Solem
Updated: 2026-03-14

Context

Dataverket intends to run across two or more datacenters and use NATS as the standard communication path between them.

That requires a concrete v1 stance on:

how datacenters are modeled
what kind of failover the platform supports
how much cross-site consistency services may assume

Without this, the platform risks mixing single-site assumptions with multi-site marketing language.

Decision

Dataverket treats each datacenter as an explicit failure domain and adopts a site-active, service-level active/passive failover model for v1.

NATS is the standard internal control-plane transport within each datacenter and the standard inter-datacenter control-plane transport between sites.

NATS is an internal platform transport. It is not the public product integration surface exposed to tenants or external automation clients.

The v1 cross-site model is:

two or more datacenters
both datacenters may be active at the same time
independent site-local operation where possible
explicit inter-site links for control-plane and replication traffic
service-specific failover rather than global transparent magic

Why site-active with service-level active/passive

The goal is to avoid wasting an entire datacenter as a permanently idle standby while still keeping failover behavior understandable.

This model allows:

datacenter A to be active for some projects or services
datacenter B to be active for other projects or services
each individual service deployment to still use a clear active/passive failover posture where needed

The platform needs a realistic first step. Active/active multi-site systems are significantly harder because they require:

stronger consistency design
more complex split-brain handling
more advanced traffic steering
deeper service-by-service reconciliation logic

A site-active, service-level active/passive model allows Dataverket to use both datacenters productively while keeping failover semantics explicit and operable.

Control-plane communication

NATS is the standard communication path between datacenters for:

control-plane coordination
task and lifecycle signaling
failover orchestration
site-aware service events

The platform should prefer site-local service behavior where possible and use cross-site NATS for explicit coordination, replication signaling, and orchestrated failover transitions rather than for pretending the whole platform is one low-latency site.

Services must assume that inter-site links can be:

slow
partitioned
temporarily unavailable

No service may assume globally consistent, zero-latency cross-site messaging.

Site model

Each datacenter must have:

its own local network fabric
its own local inventory representation
its own local compute and storage capacity
explicit site identity in API, inventory, and event metadata

Shared resources that span sites must model that fact explicitly rather than pretending they are single-site.

Failover model

The v1 failover model is service-specific:

both datacenters may host active workloads at the same time
for a given project, environment, or service instance, one datacenter is primary and another may be standby
control-plane services may have a designated primary site and one or more standby sites
tenant workloads may be recreated or reactivated in another datacenter when the product supports it
routing and traffic movement are orchestrated transitions, not implicit assumptions

Failover is therefore not defined as “everything keeps running automatically everywhere”. It is defined as “the platform can move or restore supported services to another datacenter in a controlled way”.

Service categories

The v1 service categories are:

Category A: control-plane critical

Examples:

Sentral
Identitet integration components
NATS infrastructure

These need explicit primary/standby or quorum-aware design from the beginning.

Category B: replicated platform services

Examples:

container registry metadata and images
object storage control state
PostgreSQL service control plane

These should support replication-aware design, but may still expose active/passive failover in v1.

Category C: tenant workloads

Examples:

VMs
Kubernetes clusters
apps

These should be placeable in a chosen datacenter, with different projects or workloads allowed to use different primary datacenters. Cross-site failover support is added per product capability rather than assumed universally.

NATS implications

Inter-datacenter NATS is part of the platform baseline.

That means:

site identity must be present in message metadata where relevant
failover workflows must be expressible through NATS commands and events
consumers must tolerate delayed, duplicated, or replayed cross-site messages

The architecture does not require one globally uniform behavior for all subjects. A later topology ADR must classify which streams and subjects are:

site-local only
replicated or mirrored between sites
forwarded cross-site only for explicit coordination workflows

NATS is the coordination layer, not proof that application data itself is replicated correctly.

Routing implications

Nett must support:

routed inter-datacenter links
site-aware traffic steering
failover-triggered route and policy changes
explicit separation between local east-west traffic and inter-site failover traffic

Data implications

This ADR does not claim that all data-bearing services are automatically multi-site safe.

Each stateful product still needs its own durability and replication design. In particular:

object storage replication must be defined separately
database replication and promotion must be defined separately
VM disk replication policy must be defined separately
Kubernetes cluster failover semantics must be defined separately

Operational model

Failover should be modeled as an orchestrated workflow with:

detection input
operator visibility
explicit task tracking
reversible or well-audited transitions where possible

The platform may automate parts of failover, but v1 should avoid hidden autonomous behavior that is hard to reason about during outages.

Testability requirements

The multi-datacenter posture must be testable in environments that can simulate more than one site and inject inter-site failure.

At minimum, the architecture should support repeated testing of:

inter-site partition
site-local service loss
delayed or replayed cross-site messages
failover triggering and operator approval paths
recovery and rejoin after the failed site returns

Single-site integration tests and small hardware labs are not sufficient evidence for multi-datacenter failover claims on their own.

Consequences

multi-datacenter is now a real architectural constraint rather than a future option
both datacenters can carry live workload in v1
active/passive remains the default failover posture at the service or workload level
NATS becomes a core inter-site dependency for platform coordination
every major stateful service will need a follow-up replication or promotion design
multi-site capabilities now require explicit failover and partition test coverage

Decision Outcome

Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.

This ADR builds on the NATS transport rules in 007-nats-subject-and-event-envelope.md.
The network implications described here depend on the topology and service boundaries in 005-network-service-and-topology.md.
Platform selection for switches and routers must satisfy the failover expectations defined here, as described in 013-network-platform-selection-criteria.md.
Any eventual VM runtime choice must also be evaluated against this datacenter and failover model, as described in 011-vm-runtime-selection.md.

More Information

inter-datacenter NATS topology
database replication and promotion strategy
object storage replication strategy
VM disk replication and recovery model
Kubernetes multi-site placement and failover model

Audit

2026-03-14: ADR proposed.