Inter-datacenter topology and failover DATAVERKET 012

Proposed Infrastructure Topology Failover Multi datacenter Nats Recovery

Defines Dataverket's multi-datacenter failure-domain model and the initial site-active, service-level active/passive failover posture.

Author
Lars Solem
Updated

Status

Proposed on 2026-03-14 by Lars Solem.

Context

Dataverket intends to run across two or more datacenters and use NATS as the standard communication path between them.

That requires a concrete v1 stance on:

  • how datacenters are modeled
  • what kind of failover the platform supports
  • how much cross-site consistency services may assume

Without this, the platform risks mixing single-site assumptions with multi-site marketing language.

Decision

Dataverket treats each datacenter as an explicit failure domain and adopts a site-active, service-level active/passive failover model for v1.

NATS is the standard internal control-plane transport within each datacenter and the standard inter-datacenter control-plane transport between sites.

NATS is an internal platform transport. It is not the public product integration surface exposed to tenants or external automation clients.

The v1 cross-site model is:

  • two or more datacenters
  • both datacenters may be active at the same time
  • independent site-local operation where possible
  • explicit inter-site links for control-plane and replication traffic
  • service-specific failover rather than global transparent magic

Why site-active with service-level active/passive

The goal is to avoid wasting an entire datacenter as a permanently idle standby while still keeping failover behavior understandable.

This model allows:

  • datacenter A to be active for some projects or services
  • datacenter B to be active for other projects or services
  • each individual service deployment to still use a clear active/passive failover posture where needed

The platform needs a realistic first step. Active/active multi-site systems are significantly harder because they require:

  • stronger consistency design
  • more complex split-brain handling
  • more advanced traffic steering
  • deeper service-by-service reconciliation logic

A site-active, service-level active/passive model allows Dataverket to use both datacenters productively while keeping failover semantics explicit and operable.

Control-plane communication

NATS is the standard communication path between datacenters for:

  • control-plane coordination
  • task and lifecycle signaling
  • failover orchestration
  • site-aware service events

The platform should prefer site-local service behavior where possible and use cross-site NATS for explicit coordination, replication signaling, and orchestrated failover transitions rather than for pretending the whole platform is one low-latency site.

Services must assume that inter-site links can be:

  • slow
  • partitioned
  • temporarily unavailable

No service may assume globally consistent, zero-latency cross-site messaging.

Site model

Each datacenter must have:

  • its own local network fabric
  • its own local inventory representation
  • its own local compute and storage capacity
  • explicit site identity in API, inventory, and event metadata

Shared resources that span sites must model that fact explicitly rather than pretending they are single-site.

Failover model

The v1 failover model is service-specific:

  • both datacenters may host active workloads at the same time
  • for a given project, environment, or service instance, one datacenter is primary and another may be standby
  • control-plane services may have a designated primary site and one or more standby sites
  • tenant workloads may be recreated or reactivated in another datacenter when the product supports it
  • routing and traffic movement are orchestrated transitions, not implicit assumptions

Failover is therefore not defined as “everything keeps running automatically everywhere”. It is defined as “the platform can move or restore supported services to another datacenter in a controlled way”.

Service categories

The v1 service categories are:

Category A: control-plane critical

Examples:

  • Sentral
  • Identitet integration components
  • NATS infrastructure

These need explicit primary/standby or quorum-aware design from the beginning.

Category B: replicated platform services

Examples:

  • container registry metadata and images
  • object storage control state
  • PostgreSQL service control plane

These should support replication-aware design, but may still expose active/passive failover in v1.

Category C: tenant workloads

Examples:

  • VMs
  • Kubernetes clusters
  • apps

These should be placeable in a chosen datacenter, with different projects or workloads allowed to use different primary datacenters. Cross-site failover support is added per product capability rather than assumed universally.

NATS implications

Inter-datacenter NATS is part of the platform baseline.

That means:

  • site identity must be present in message metadata where relevant
  • failover workflows must be expressible through NATS commands and events
  • consumers must tolerate delayed, duplicated, or replayed cross-site messages

The architecture does not require one globally uniform behavior for all subjects. A later topology ADR must classify which streams and subjects are:

  • site-local only
  • replicated or mirrored between sites
  • forwarded cross-site only for explicit coordination workflows

NATS is the coordination layer, not proof that application data itself is replicated correctly.

Routing implications

Nett must support:

  • routed inter-datacenter links
  • site-aware traffic steering
  • failover-triggered route and policy changes
  • explicit separation between local east-west traffic and inter-site failover traffic

Data implications

This ADR does not claim that all data-bearing services are automatically multi-site safe.

Each stateful product still needs its own durability and replication design. In particular:

  • object storage replication must be defined separately
  • database replication and promotion must be defined separately
  • VM disk replication policy must be defined separately
  • Kubernetes cluster failover semantics must be defined separately

Operational model

Failover should be modeled as an orchestrated workflow with:

  • detection input
  • operator visibility
  • explicit task tracking
  • reversible or well-audited transitions where possible

The platform may automate parts of failover, but v1 should avoid hidden autonomous behavior that is hard to reason about during outages.

Testability requirements

The multi-datacenter posture must be testable in environments that can simulate more than one site and inject inter-site failure.

At minimum, the architecture should support repeated testing of:

  • inter-site partition
  • site-local service loss
  • delayed or replayed cross-site messages
  • failover triggering and operator approval paths
  • recovery and rejoin after the failed site returns

Single-site integration tests and small hardware labs are not sufficient evidence for multi-datacenter failover claims on their own.

Consequences

  • multi-datacenter is now a real architectural constraint rather than a future option
  • both datacenters can carry live workload in v1
  • active/passive remains the default failover posture at the service or workload level
  • NATS becomes a core inter-site dependency for platform coordination
  • every major stateful service will need a follow-up replication or promotion design
  • multi-site capabilities now require explicit failover and partition test coverage

Decision Outcome

Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.

More Information

  • inter-datacenter NATS topology
  • database replication and promotion strategy
  • object storage replication strategy
  • VM disk replication and recovery model
  • Kubernetes multi-site placement and failover model

Audit

  • 2026-03-14: ADR proposed.