Upgrade and migration strategy DATAVERKET 025

Proposed Migration Lifecycle Upgrade Compatibility Rollout

Defines staged, compatibility-aware upgrade and migration strategy across Dataverket services, schemas, workflows, and infrastructure.

Author
Lars Solem
Updated

Status

Proposed on 2026-03-14 by Lars Solem.

Context

Dataverket is building a control plane with many critical moving parts:

  • Sentral and other control-plane services
  • NATS and JetStream
  • PostgreSQL-backed state
  • Talos-based clusters and nodes
  • storage-backed products
  • public APIs and internal event contracts

Without an explicit upgrade and migration strategy, the platform risks depending on flag-day upgrades, undocumented compatibility assumptions, or manual heroics during change windows.

Decision

Dataverket adopts a staged, compatibility-aware upgrade model.

The platform should prefer:

  • rolling or staged upgrades where practical
  • explicit compatibility windows
  • reversible migrations where possible
  • operator-visible upgrade state
  • no requirement for whole-platform stop-the-world upgrades

Upgrade domains

The platform must treat these as separate upgrade domains:

  • control-plane services
  • public API contracts
  • internal event schemas and consumers
  • PostgreSQL schema and data migrations
  • NATS and JetStream infrastructure
  • Talos nodes and Kubernetes clusters
  • storage and network control integrations

Different domains may need different safety rules, but all must fit one coherent upgrade posture.

Service upgrade model

For normal control-plane services, the preferred posture is:

  • deploy new versions alongside or incrementally over old ones where possible
  • preserve compatibility during rollout windows
  • avoid simultaneous dependence on hard cutovers across all services

If a service requires a hard cutover, that should be treated as an exception requiring explicit review.

API evolution model

Public API evolution should follow:

  • additive change by default
  • explicit versioning for breaking changes
  • compatibility windows for clients and SDKs

Breaking public behavior must not be hidden inside “minor” platform upgrades.

Event and message evolution model

Internal event and command schemas should evolve with:

  • additive fields where possible
  • consumer tolerance for unknown fields
  • explicit schema versioning when semantics change
  • upgrade sequencing that avoids breaking lagging consumers immediately

The platform must not assume that all consumers are upgraded at exactly the same time.

Database migration model

Database changes must prefer:

  • forward-compatible migrations where practical
  • controlled sequencing between schema changes and application rollout
  • explicit rollback or recovery expectations
  • tested migration paths in non-production environments

An application change that only works after an immediate irreversible schema cutover should be treated as a risk.

NATS and workflow considerations

Upgrade strategy must account for:

  • in-flight tasks
  • replayed or delayed messages
  • consumer restarts
  • compatibility between old and new workflow handlers

Upgrade logic must assume the event system is live, not empty.

Talos and infrastructure upgrade model

Infrastructure upgrades should also be modeled explicitly for:

  • Talos node upgrades
  • Kubernetes control-plane and worker upgrades
  • network automation components
  • storage platform dependencies

These upgrades may require stricter orchestration than normal service rollouts, but they still need documented sequencing and operator visibility.

Migration model

Migration includes more than version bumps. The platform should also support:

  • data migrations
  • resource ownership transitions
  • topology model changes
  • policy model changes
  • service decomposition or extraction over time

Migration paths should be designed, not improvised after the fact.

Operator visibility

Operators must be able to see:

  • what is being upgraded
  • current stage or rollout state
  • blocking failures
  • rollback or recovery expectations
  • compatibility warnings where relevant

Upgrades are workflows and should be treated as such.

Explicit non-decisions for now

This ADR intentionally does not yet choose:

  • a specific rollout controller
  • a specific database migration tool
  • exact compatibility window lengths
  • exact canary or blue/green implementation

Those require later implementation decisions.

Consequences

  • upgrades become a designed platform behavior instead of an ad hoc operational exercise
  • service teams must think about compatibility and sequencing earlier
  • event-driven and multi-datacenter reality are now part of the upgrade model

Decision Outcome

Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.

More Information

  • rollout implementation model
  • database migration tooling and rules
  • event schema compatibility policy
  • Talos and cluster upgrade policy

Audit

  • 2026-03-14: ADR proposed.