Upgrade and migration strategy DATAVERKET 025

Proposed Migration Lifecycle Upgrade Compatibility Rollout

Defines staged, compatibility-aware upgrade and migration strategy across Dataverket services, schemas, workflows, and infrastructure.

Author: Lars Solem
Updated: 2026-03-14

Status

Proposed on 2026-03-14 by Lars Solem.

Context

Dataverket is building a control plane with many critical moving parts:

Sentral and other control-plane services
NATS and JetStream
PostgreSQL-backed state
Talos-based clusters and nodes
storage-backed products
public APIs and internal event contracts

Without an explicit upgrade and migration strategy, the platform risks depending on flag-day upgrades, undocumented compatibility assumptions, or manual heroics during change windows.

Decision

Dataverket adopts a staged, compatibility-aware upgrade model.

The platform should prefer:

rolling or staged upgrades where practical
explicit compatibility windows
reversible migrations where possible
operator-visible upgrade state
no requirement for whole-platform stop-the-world upgrades

Upgrade domains

The platform must treat these as separate upgrade domains:

control-plane services
public API contracts
internal event schemas and consumers
PostgreSQL schema and data migrations
NATS and JetStream infrastructure
Talos nodes and Kubernetes clusters
storage and network control integrations

Different domains may need different safety rules, but all must fit one coherent upgrade posture.

Service upgrade model

For normal control-plane services, the preferred posture is:

deploy new versions alongside or incrementally over old ones where possible
preserve compatibility during rollout windows
avoid simultaneous dependence on hard cutovers across all services

If a service requires a hard cutover, that should be treated as an exception requiring explicit review.

API evolution model

Public API evolution should follow:

additive change by default
explicit versioning for breaking changes
compatibility windows for clients and SDKs

Breaking public behavior must not be hidden inside “minor” platform upgrades.

Event and message evolution model

Internal event and command schemas should evolve with:

additive fields where possible
consumer tolerance for unknown fields
explicit schema versioning when semantics change
upgrade sequencing that avoids breaking lagging consumers immediately

The platform must not assume that all consumers are upgraded at exactly the same time.

Database migration model

Database changes must prefer:

forward-compatible migrations where practical
controlled sequencing between schema changes and application rollout
explicit rollback or recovery expectations
tested migration paths in non-production environments

An application change that only works after an immediate irreversible schema cutover should be treated as a risk.

NATS and workflow considerations

Upgrade strategy must account for:

in-flight tasks
replayed or delayed messages
consumer restarts
compatibility between old and new workflow handlers

Upgrade logic must assume the event system is live, not empty.

Talos and infrastructure upgrade model

Infrastructure upgrades should also be modeled explicitly for:

Talos node upgrades
Kubernetes control-plane and worker upgrades
network automation components
storage platform dependencies

These upgrades may require stricter orchestration than normal service rollouts, but they still need documented sequencing and operator visibility.

Migration model

Migration includes more than version bumps. The platform should also support:

data migrations
resource ownership transitions
topology model changes
policy model changes
service decomposition or extraction over time

Migration paths should be designed, not improvised after the fact.

Operator visibility

Operators must be able to see:

what is being upgraded
current stage or rollout state
blocking failures
rollback or recovery expectations
compatibility warnings where relevant

Upgrades are workflows and should be treated as such.

Explicit non-decisions for now

This ADR intentionally does not yet choose:

a specific rollout controller
a specific database migration tool
exact compatibility window lengths
exact canary or blue/green implementation

Those require later implementation decisions.

Consequences

upgrades become a designed platform behavior instead of an ad hoc operational exercise
service teams must think about compatibility and sequencing earlier
event-driven and multi-datacenter reality are now part of the upgrade model

Decision Outcome

Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.

Public API behavior must align with 008-public-api-style.md.
Event and workflow handling must align with 007-nats-subject-and-event-envelope.md and 019-workflow-retry-dead-letter-and-reconciliation.md.
PostgreSQL migration risk must align with 018-postgresql-control-plane-ha-and-backup.md.
Operator visibility must align with 022-operator-visibility-and-control-surface.md.

More Information

rollout implementation model
database migration tooling and rules
event schema compatibility policy
Talos and cluster upgrade policy

Audit

2026-03-14: ADR proposed.