PostgreSQL control-plane HA and backup SENTRAL 018

Proposed Data Database Postgresql Availability Backup Recovery

Defines PostgreSQL high availability, backup, and recovery expectations for Dataverket control-plane state.

Author
Lars Solem
Updated

Status

Proposed on 2026-03-14 by Lars Solem.

Context

Dataverket uses PostgreSQL as the default relational system of record for control-plane state.

That makes PostgreSQL one of the most critical dependencies in the entire platform. If the relational control-plane state is unavailable or lost, the platform may lose visibility into inventory, tasks, desired state, tenancy, or billing context.

Until now PostgreSQL has been treated more as a default implementation assumption than as an explicitly designed availability and recovery dependency.

Decision

Dataverket treats control-plane PostgreSQL as a first-class critical service with explicit requirements for:

  • availability
  • backup
  • restore
  • operational visibility
  • failure-domain-aware deployment
  • replication posture
  • bounded-context-specific recovery targets
  • split-brain avoidance during failover and promotion

The platform must not rely on a single implicit PostgreSQL instance without HA and backup design.

Scope

This ADR applies to PostgreSQL used for control-plane and platform service state, including at least:

  • Sentral state
  • task and orchestration state where relational persistence is used
  • inventory and tenancy state
  • other service databases where PostgreSQL is the authoritative store

This ADR does not automatically define the tenant-facing managed PostgreSQL product. That product may reuse similar patterns, but it is a separate service concern.

Data ownership model

The platform should prefer:

  • one database per service
  • or at minimum one schema per bounded context

It should avoid:

  • one undifferentiated shared database for the entire platform
  • tight coupling between unrelated services through direct table sharing

This reduces blast radius and makes backup, migration, and recovery more tractable.

Different bounded contexts may justify different availability and recovery postures. The platform must not assume one uniform PostgreSQL policy fits tenancy, inventory, task state, audit history, and every later service database equally well.

Availability requirements

The control-plane PostgreSQL strategy must support:

  • instance or node failure without total control-plane collapse
  • planned maintenance with controlled service impact
  • clear primary/replica or equivalent role semantics
  • operator visibility into failover state

The exact HA technology remains open for now, but “manual restore from scratch after every database failure” is not acceptable.

Replication posture

For multi-datacenter operation, Dataverket must take an explicit stance on cross-site replication rather than treating “HA” as a sufficient answer.

The default v1 posture should be:

  • synchronous replication may be used within a local failure domain where latency and quorum behavior are acceptable
  • asynchronous replication is the safer default across datacenters unless a specific bounded context justifies stronger consistency
  • cross-site failover claims must state the resulting data-loss window explicitly

This is a posture, not a final tooling choice. It exists to prevent accidental promises of zero data loss across sites when the network and quorum tradeoffs have not been accepted deliberately.

Recovery targets by bounded context

PostgreSQL recovery expectations must be defined per service or bounded context, not only for “the database layer” in the abstract.

At minimum, each control-plane bounded context should declare:

  • target RPO class
  • target RTO class
  • whether cross-site replication is required, optional, or not assumed
  • whether failover is automatic, operator-approved, or manual

The exact numbers may be chosen later, but the classification itself is required early because it affects architecture.

Representative examples:

  • Inventory and tenancy Usually stricter RPO and promotion discipline because incorrect or missing control state can block most of the platform.

  • Task and orchestration May tolerate different loss or rebuild characteristics than inventory, but only if replay and reconciliation semantics make that safe.

  • Audit and event history May tolerate slower recovery than the immediate control path, provided integrity and retention expectations remain explicit.

Split-brain and promotion safety

PostgreSQL design must explicitly prevent unsafe dual-primary or ambiguous promotion behavior during partition or site loss.

That means the architecture must define:

  • who or what is allowed to promote a replica
  • what quorum, fencing, or operator-approval rules apply
  • what happens when sites cannot communicate reliably
  • how the previously active primary is prevented from resuming writes incorrectly after partition recovery

The platform should prefer temporary unavailability over silent split-brain corruption for control-plane relational state.

Backup requirements

The baseline must support:

  • regular backups
  • point-in-time recovery capability where justified
  • tested restore workflows
  • backup visibility and failure alerting
  • retention policies appropriate to control-plane recovery needs

A backup that has not been restored successfully in testing should not be treated as trustworthy.

Multi-datacenter implications

Because Dataverket is multi-datacenter, PostgreSQL design must state clearly:

  • whether replicas exist across sites
  • whether replication is synchronous or asynchronous for each relevant database class
  • what the failover posture is for each database class
  • what data loss window is acceptable in cross-site failure
  • what the promotion workflow is during site loss

Database classes with materially different consistency and recovery needs should not be forced into the same cross-site policy without justification.

The platform must not imply cross-datacenter failover for PostgreSQL-backed services unless the relational state has a matching recovery story.

Restore and recovery requirements

Operators must be able to answer:

  • what backup exists
  • how recent it is
  • how to restore it
  • how long restore is expected to take
  • what data loss is expected under each failure scenario
  • which bounded contexts have stricter or weaker recovery targets
  • whether a cross-site promotion is currently safe, blocked, or operator-gated

That means restore is a designed workflow, not an improvisation.

Observability requirements

The PostgreSQL control-plane baseline must provide visibility into:

  • primary and replica health
  • replication lag where applicable
  • backup success and failure
  • storage health and capacity risk
  • failover state and recent role changes
  • promotion eligibility and fencing state where applicable
  • split-brain risk signals or blocked promotion state

PostgreSQL cannot be a black box if so much of the control plane depends on it.

Relationship to storage

PostgreSQL HA and recovery are inseparable from storage behavior.

That means:

  • storage durability assumptions must be explicit
  • backup and restore tooling must align with the storage strategy
  • cross-site database claims must align with the persistence strategy in the storage ADR

Explicit non-decisions for now

This ADR intentionally does not yet choose:

  • a specific PostgreSQL HA implementation
  • a specific backup tool
  • the exact numeric RPO and RTO targets
  • the exact quorum or fencing technology

Those require later implementation or selection ADRs.

Consequences

  • PostgreSQL is now treated as a deliberate availability dependency, not a quiet default
  • service teams must think about data ownership and restore boundaries earlier
  • platform failover claims must be honest about relational state
  • cross-site replication tradeoffs must now be explicit per bounded context rather than hidden behind generic “HA” language
  • split-brain avoidance becomes a first-class safety requirement

Decision Outcome

Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.

More Information

  • PostgreSQL HA implementation selection
  • backup and restore tooling selection
  • cross-datacenter replication and promotion policy
  • RPO and RTO targets for control-plane data

Audit

  • 2026-03-14: ADR proposed.