PostgreSQL control-plane HA and backup SENTRAL 018

Proposed Data Database Postgresql Availability Backup Recovery

Defines PostgreSQL high availability, backup, and recovery expectations for Dataverket control-plane state.

Author: Lars Solem
Updated: 2026-03-14

Context

Dataverket uses PostgreSQL as the default relational system of record for control-plane state.

That makes PostgreSQL one of the most critical dependencies in the entire platform. If the relational control-plane state is unavailable or lost, the platform may lose visibility into inventory, tasks, desired state, tenancy, or billing context.

Until now PostgreSQL has been treated more as a default implementation assumption than as an explicitly designed availability and recovery dependency.

Decision

Dataverket treats control-plane PostgreSQL as a first-class critical service with explicit requirements for:

availability
backup
restore
operational visibility
failure-domain-aware deployment
replication posture
bounded-context-specific recovery targets
split-brain avoidance during failover and promotion

The platform must not rely on a single implicit PostgreSQL instance without HA and backup design.

Scope

This ADR applies to PostgreSQL used for control-plane and platform service state, including at least:

Sentral state
task and orchestration state where relational persistence is used
inventory and tenancy state
other service databases where PostgreSQL is the authoritative store

This ADR does not automatically define the tenant-facing managed PostgreSQL product. That product may reuse similar patterns, but it is a separate service concern.

Data ownership model

The platform should prefer:

one database per service
or at minimum one schema per bounded context

It should avoid:

one undifferentiated shared database for the entire platform
tight coupling between unrelated services through direct table sharing

This reduces blast radius and makes backup, migration, and recovery more tractable.

Different bounded contexts may justify different availability and recovery postures. The platform must not assume one uniform PostgreSQL policy fits tenancy, inventory, task state, audit history, and every later service database equally well.

Availability requirements

The control-plane PostgreSQL strategy must support:

instance or node failure without total control-plane collapse
planned maintenance with controlled service impact
clear primary/replica or equivalent role semantics
operator visibility into failover state

The exact HA technology remains open for now, but “manual restore from scratch after every database failure” is not acceptable.

Replication posture

For multi-datacenter operation, Dataverket must take an explicit stance on cross-site replication rather than treating “HA” as a sufficient answer.

The default v1 posture should be:

synchronous replication may be used within a local failure domain where latency and quorum behavior are acceptable
asynchronous replication is the safer default across datacenters unless a specific bounded context justifies stronger consistency
cross-site failover claims must state the resulting data-loss window explicitly

This is a posture, not a final tooling choice. It exists to prevent accidental promises of zero data loss across sites when the network and quorum tradeoffs have not been accepted deliberately.

Recovery targets by bounded context

PostgreSQL recovery expectations must be defined per service or bounded context, not only for “the database layer” in the abstract.

At minimum, each control-plane bounded context should declare:

target RPO class
target RTO class
whether cross-site replication is required, optional, or not assumed
whether failover is automatic, operator-approved, or manual

The exact numbers may be chosen later, but the classification itself is required early because it affects architecture.

Representative examples:

Inventory and tenancy Usually stricter RPO and promotion discipline because incorrect or missing control state can block most of the platform.
Task and orchestration May tolerate different loss or rebuild characteristics than inventory, but only if replay and reconciliation semantics make that safe.
Audit and event history May tolerate slower recovery than the immediate control path, provided integrity and retention expectations remain explicit.

Split-brain and promotion safety

PostgreSQL design must explicitly prevent unsafe dual-primary or ambiguous promotion behavior during partition or site loss.

That means the architecture must define:

who or what is allowed to promote a replica
what quorum, fencing, or operator-approval rules apply
what happens when sites cannot communicate reliably
how the previously active primary is prevented from resuming writes incorrectly after partition recovery

The platform should prefer temporary unavailability over silent split-brain corruption for control-plane relational state.

Backup requirements

The baseline must support:

regular backups
point-in-time recovery capability where justified
tested restore workflows
backup visibility and failure alerting
retention policies appropriate to control-plane recovery needs

A backup that has not been restored successfully in testing should not be treated as trustworthy.

Multi-datacenter implications

Because Dataverket is multi-datacenter, PostgreSQL design must state clearly:

whether replicas exist across sites
whether replication is synchronous or asynchronous for each relevant database class
what the failover posture is for each database class
what data loss window is acceptable in cross-site failure
what the promotion workflow is during site loss

Database classes with materially different consistency and recovery needs should not be forced into the same cross-site policy without justification.

The platform must not imply cross-datacenter failover for PostgreSQL-backed services unless the relational state has a matching recovery story.

Restore and recovery requirements

Operators must be able to answer:

what backup exists
how recent it is
how to restore it
how long restore is expected to take
what data loss is expected under each failure scenario
which bounded contexts have stricter or weaker recovery targets
whether a cross-site promotion is currently safe, blocked, or operator-gated

That means restore is a designed workflow, not an improvisation.

Observability requirements

The PostgreSQL control-plane baseline must provide visibility into:

primary and replica health
replication lag where applicable
backup success and failure
storage health and capacity risk
failover state and recent role changes
promotion eligibility and fencing state where applicable
split-brain risk signals or blocked promotion state

PostgreSQL cannot be a black box if so much of the control plane depends on it.

Relationship to storage

PostgreSQL HA and recovery are inseparable from storage behavior.

That means:

storage durability assumptions must be explicit
backup and restore tooling must align with the storage strategy
cross-site database claims must align with the persistence strategy in the storage ADR

Explicit non-decisions for now

This ADR intentionally does not yet choose:

a specific PostgreSQL HA implementation
a specific backup tool
the exact numeric RPO and RTO targets
the exact quorum or fencing technology

Those require later implementation or selection ADRs.

Consequences

PostgreSQL is now treated as a deliberate availability dependency, not a quiet default
service teams must think about data ownership and restore boundaries earlier
platform failover claims must be honest about relational state
cross-site replication tradeoffs must now be explicit per bounded context rather than hidden behind generic “HA” language
split-brain avoidance becomes a first-class safety requirement

Decision Outcome

Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.

This ADR supports the data strategy described in platform-plan.md.
Storage assumptions must stay aligned with 015-storage-platform-and-persistence-strategy.md.
Multi-datacenter failover language must stay aligned with 012-inter-datacenter-topology-and-failover.md.
Observability expectations must align with 017-observability-and-operations-baseline.md.

More Information

PostgreSQL HA implementation selection
backup and restore tooling selection
cross-datacenter replication and promotion policy
RPO and RTO targets for control-plane data

Audit

2026-03-14: ADR proposed.