Status
Proposed on 2026-03-14 by Lars Solem.
Context
Dataverket is building a control plane with many critical moving parts:
- Sentral and other control-plane services
- NATS and JetStream
- PostgreSQL-backed state
- Talos-based clusters and nodes
- storage-backed products
- public APIs and internal event contracts
Without an explicit upgrade and migration strategy, the platform risks depending on flag-day upgrades, undocumented compatibility assumptions, or manual heroics during change windows.
Decision
Dataverket adopts a staged, compatibility-aware upgrade model.
The platform should prefer:
- rolling or staged upgrades where practical
- explicit compatibility windows
- reversible migrations where possible
- operator-visible upgrade state
- no requirement for whole-platform stop-the-world upgrades
Upgrade domains
The platform must treat these as separate upgrade domains:
- control-plane services
- public API contracts
- internal event schemas and consumers
- PostgreSQL schema and data migrations
- NATS and JetStream infrastructure
- Talos nodes and Kubernetes clusters
- storage and network control integrations
Different domains may need different safety rules, but all must fit one coherent upgrade posture.
Service upgrade model
For normal control-plane services, the preferred posture is:
- deploy new versions alongside or incrementally over old ones where possible
- preserve compatibility during rollout windows
- avoid simultaneous dependence on hard cutovers across all services
If a service requires a hard cutover, that should be treated as an exception requiring explicit review.
API evolution model
Public API evolution should follow:
- additive change by default
- explicit versioning for breaking changes
- compatibility windows for clients and SDKs
Breaking public behavior must not be hidden inside “minor” platform upgrades.
Event and message evolution model
Internal event and command schemas should evolve with:
- additive fields where possible
- consumer tolerance for unknown fields
- explicit schema versioning when semantics change
- upgrade sequencing that avoids breaking lagging consumers immediately
The platform must not assume that all consumers are upgraded at exactly the same time.
Database migration model
Database changes must prefer:
- forward-compatible migrations where practical
- controlled sequencing between schema changes and application rollout
- explicit rollback or recovery expectations
- tested migration paths in non-production environments
An application change that only works after an immediate irreversible schema cutover should be treated as a risk.
NATS and workflow considerations
Upgrade strategy must account for:
- in-flight tasks
- replayed or delayed messages
- consumer restarts
- compatibility between old and new workflow handlers
Upgrade logic must assume the event system is live, not empty.
Talos and infrastructure upgrade model
Infrastructure upgrades should also be modeled explicitly for:
- Talos node upgrades
- Kubernetes control-plane and worker upgrades
- network automation components
- storage platform dependencies
These upgrades may require stricter orchestration than normal service rollouts, but they still need documented sequencing and operator visibility.
Migration model
Migration includes more than version bumps. The platform should also support:
- data migrations
- resource ownership transitions
- topology model changes
- policy model changes
- service decomposition or extraction over time
Migration paths should be designed, not improvised after the fact.
Operator visibility
Operators must be able to see:
- what is being upgraded
- current stage or rollout state
- blocking failures
- rollback or recovery expectations
- compatibility warnings where relevant
Upgrades are workflows and should be treated as such.
Explicit non-decisions for now
This ADR intentionally does not yet choose:
- a specific rollout controller
- a specific database migration tool
- exact compatibility window lengths
- exact canary or blue/green implementation
Those require later implementation decisions.
Consequences
- upgrades become a designed platform behavior instead of an ad hoc operational exercise
- service teams must think about compatibility and sequencing earlier
- event-driven and multi-datacenter reality are now part of the upgrade model
Decision Outcome
Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.
Related Decisions
- Public API behavior must align with 008-public-api-style.md.
- Event and workflow handling must align with 007-nats-subject-and-event-envelope.md and 019-workflow-retry-dead-letter-and-reconciliation.md.
- PostgreSQL migration risk must align with 018-postgresql-control-plane-ha-and-backup.md.
- Operator visibility must align with 022-operator-visibility-and-control-surface.md.
More Information
- rollout implementation model
- database migration tooling and rules
- event schema compatibility policy
- Talos and cluster upgrade policy
Audit
- 2026-03-14: ADR proposed.