Status
Proposed on 2026-03-14 by Lars Solem.
Context
Dataverket uses PostgreSQL as the default relational system of record for control-plane state.
That makes PostgreSQL one of the most critical dependencies in the entire platform. If the relational control-plane state is unavailable or lost, the platform may lose visibility into inventory, tasks, desired state, tenancy, or billing context.
Until now PostgreSQL has been treated more as a default implementation assumption than as an explicitly designed availability and recovery dependency.
Decision
Dataverket treats control-plane PostgreSQL as a first-class critical service with explicit requirements for:
- availability
- backup
- restore
- operational visibility
- failure-domain-aware deployment
- replication posture
- bounded-context-specific recovery targets
- split-brain avoidance during failover and promotion
The platform must not rely on a single implicit PostgreSQL instance without HA and backup design.
Scope
This ADR applies to PostgreSQL used for control-plane and platform service state, including at least:
- Sentral state
- task and orchestration state where relational persistence is used
- inventory and tenancy state
- other service databases where PostgreSQL is the authoritative store
This ADR does not automatically define the tenant-facing managed PostgreSQL product. That product may reuse similar patterns, but it is a separate service concern.
Data ownership model
The platform should prefer:
- one database per service
- or at minimum one schema per bounded context
It should avoid:
- one undifferentiated shared database for the entire platform
- tight coupling between unrelated services through direct table sharing
This reduces blast radius and makes backup, migration, and recovery more tractable.
Different bounded contexts may justify different availability and recovery postures. The platform must not assume one uniform PostgreSQL policy fits tenancy, inventory, task state, audit history, and every later service database equally well.
Availability requirements
The control-plane PostgreSQL strategy must support:
- instance or node failure without total control-plane collapse
- planned maintenance with controlled service impact
- clear primary/replica or equivalent role semantics
- operator visibility into failover state
The exact HA technology remains open for now, but “manual restore from scratch after every database failure” is not acceptable.
Replication posture
For multi-datacenter operation, Dataverket must take an explicit stance on cross-site replication rather than treating “HA” as a sufficient answer.
The default v1 posture should be:
- synchronous replication may be used within a local failure domain where latency and quorum behavior are acceptable
- asynchronous replication is the safer default across datacenters unless a specific bounded context justifies stronger consistency
- cross-site failover claims must state the resulting data-loss window explicitly
This is a posture, not a final tooling choice. It exists to prevent accidental promises of zero data loss across sites when the network and quorum tradeoffs have not been accepted deliberately.
Recovery targets by bounded context
PostgreSQL recovery expectations must be defined per service or bounded context, not only for “the database layer” in the abstract.
At minimum, each control-plane bounded context should declare:
- target RPO class
- target RTO class
- whether cross-site replication is required, optional, or not assumed
- whether failover is automatic, operator-approved, or manual
The exact numbers may be chosen later, but the classification itself is required early because it affects architecture.
Representative examples:
Inventory and tenancy Usually stricter RPO and promotion discipline because incorrect or missing control state can block most of the platform.
Task and orchestration May tolerate different loss or rebuild characteristics than inventory, but only if replay and reconciliation semantics make that safe.
Audit and event history May tolerate slower recovery than the immediate control path, provided integrity and retention expectations remain explicit.
Split-brain and promotion safety
PostgreSQL design must explicitly prevent unsafe dual-primary or ambiguous promotion behavior during partition or site loss.
That means the architecture must define:
- who or what is allowed to promote a replica
- what quorum, fencing, or operator-approval rules apply
- what happens when sites cannot communicate reliably
- how the previously active primary is prevented from resuming writes incorrectly after partition recovery
The platform should prefer temporary unavailability over silent split-brain corruption for control-plane relational state.
Backup requirements
The baseline must support:
- regular backups
- point-in-time recovery capability where justified
- tested restore workflows
- backup visibility and failure alerting
- retention policies appropriate to control-plane recovery needs
A backup that has not been restored successfully in testing should not be treated as trustworthy.
Multi-datacenter implications
Because Dataverket is multi-datacenter, PostgreSQL design must state clearly:
- whether replicas exist across sites
- whether replication is synchronous or asynchronous for each relevant database class
- what the failover posture is for each database class
- what data loss window is acceptable in cross-site failure
- what the promotion workflow is during site loss
Database classes with materially different consistency and recovery needs should not be forced into the same cross-site policy without justification.
The platform must not imply cross-datacenter failover for PostgreSQL-backed services unless the relational state has a matching recovery story.
Restore and recovery requirements
Operators must be able to answer:
- what backup exists
- how recent it is
- how to restore it
- how long restore is expected to take
- what data loss is expected under each failure scenario
- which bounded contexts have stricter or weaker recovery targets
- whether a cross-site promotion is currently safe, blocked, or operator-gated
That means restore is a designed workflow, not an improvisation.
Observability requirements
The PostgreSQL control-plane baseline must provide visibility into:
- primary and replica health
- replication lag where applicable
- backup success and failure
- storage health and capacity risk
- failover state and recent role changes
- promotion eligibility and fencing state where applicable
- split-brain risk signals or blocked promotion state
PostgreSQL cannot be a black box if so much of the control plane depends on it.
Relationship to storage
PostgreSQL HA and recovery are inseparable from storage behavior.
That means:
- storage durability assumptions must be explicit
- backup and restore tooling must align with the storage strategy
- cross-site database claims must align with the persistence strategy in the storage ADR
Explicit non-decisions for now
This ADR intentionally does not yet choose:
- a specific PostgreSQL HA implementation
- a specific backup tool
- the exact numeric RPO and RTO targets
- the exact quorum or fencing technology
Those require later implementation or selection ADRs.
Consequences
- PostgreSQL is now treated as a deliberate availability dependency, not a quiet default
- service teams must think about data ownership and restore boundaries earlier
- platform failover claims must be honest about relational state
- cross-site replication tradeoffs must now be explicit per bounded context rather than hidden behind generic “HA” language
- split-brain avoidance becomes a first-class safety requirement
Decision Outcome
Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.
Related Decisions
- This ADR supports the data strategy described in platform-plan.md.
- Storage assumptions must stay aligned with 015-storage-platform-and-persistence-strategy.md.
- Multi-datacenter failover language must stay aligned with 012-inter-datacenter-topology-and-failover.md.
- Observability expectations must align with 017-observability-and-operations-baseline.md.
More Information
- PostgreSQL HA implementation selection
- backup and restore tooling selection
- cross-datacenter replication and promotion policy
- RPO and RTO targets for control-plane data
Audit
- 2026-03-14: ADR proposed.