Dataverket platform plan

Status

This document is a working platform plan derived from the accepted decisions in this repository and the current target architecture described by the team.

It is intentionally opinionated where the repository is silent, and those assumptions are called out as proposed decisions rather than accepted ones.

Inputs from existing repository decisions

The current repository establishes the following baseline:

The parent system is Dataverket.
The current accepted service names are:
- Sentral: control plane, orchestration
- Maskin: compute, VMs and bare metal
- Plattform: Kubernetes platform
- Identitet: identity and access management
- Tjeneste: application deployment
- Objekt: object storage
The accepted CLI structure is a dv command family:
- dv, dvid, dvce, dvke, dvapp, dvos, dvnet
The accepted out-of-band access model is:
- ZITADEL + Teleport for normal access
- dormant break-glass via hardware-rooted CA
- Talos integration is explicitly expected

Problem statement

Dataverket needs to become a full datacenter automation platform that can:

Install and control bare-metal servers
Manage switching, VLANs, L3 gateways, VPN, and logical tenant networks
Support two or more datacenters with failure handling and service failover between them
Provision Kubernetes clusters, VMs, databases, object storage, container registry, and web hosting
Expose all capabilities through an API, SDKs, and CLI tools
Use NATS as the backbone for inter-service communication

Platform principles

The plan should follow these design principles:

API-first Every platform capability must exist as an internal API before it becomes a CLI command or UI feature.
Event-driven orchestration Commands produce desired state and lifecycle events on NATS; workers converge infrastructure toward that state.
Hardware-aware control plane The platform must understand BMCs, PXE/iPXE, switches, routers, hypervisors, and Talos node lifecycle.
Stateless node operating model Bare-metal servers should be treated as disposable and re-installable. Local disk may exist, but platform correctness must not depend on artisanal node state.
Strong tenancy boundaries Projects, environments, RBAC, network segmentation, secrets, and auditability must be built in from day one.
Sovereign operations Avoid dependencies on hosted control planes for core operation. External services may exist at the edge, not in the platform core.
Multi-datacenter by design The platform should model datacenters as first-class failure domains and support operation across two or more datacenters from the start.

Proposed target architecture

The architecture should distinguish clearly between:

Internal platform services Control-plane and execution services Dataverket runs to operate the platform itself.
Platform offerings The products and managed resource types Dataverket exposes to operators and tenants through the public API.

Those two layers are related, but they are not the same thing. Internal service boundaries should not be exposed blindly as public product boundaries.

The design must also distinguish between full availability, degraded operation, and dependency-blocked operation. Cross-service failure behavior should be designed explicitly rather than delegated to retries alone.

Control plane services

Sentral

Sentral is the system of record and orchestrator. It should own:

projects, tenants, environments, quotas
inventory and resource graph
reconciliation workflow definitions
audit log and task tracking
public API entrypoint

Sentral should store desired state in PostgreSQL and publish commands/events on NATS JetStream.

Sentral should also understand datacenter placement, failure domains, and resource locality so that scheduling and failover decisions can span more than one site.

Sentral must not become a single undifferentiated monolith. Internally it should be treated as a set of bounded contexts:

tenancy, projects, and environments
inventory and resource graph
task orchestration
audit and policy-facing event history
public API gateway and resource facade

Those areas may initially ship together, but they should have explicit internal boundaries and a path to later extraction if one part outgrows the others.

The inventory context should have its own PostgreSQL schema boundary from day 1, even if Sentral still ships as one process initially. The API facade should compose over that boundary rather than sharing mutable tables freely.

Sentral also needs an explicit consistency model:

optimistic concurrency for ordinary resource edits
transactional reservation semantics for scarce allocations
task-driven saga orchestration across service boundaries

Without that, asynchronous workflows and PostgreSQL-backed desired state still leave core race conditions undefined.

Identitet

Identitet should provide:

user, group, and machine identity
OIDC via ZITADEL
service-to-service identity
token issuance for CLI and SDKs
RBAC and authorization inputs
operator and service authentication policy

Teleport remains the operational access path for humans and privileged workflows.

Identitet is under-specified in the current document set and should become an explicit ADR early. Identity, authentication, authorization, and service-to-service trust are too central to remain implicit.

Nett

The CLI ADR already introduces dvnet, but the service naming ADR does not yet define a corresponding service. This plan proposes Nett as an explicit platform service.

Nett should be split into two categories:

Infrastructure-side Nett Fabric and device-facing automation for switches, routers, underlay, and inter-datacenter connectivity.
Provider-side Nett Tenant-facing network products and abstractions built on top of the fabric.

The infrastructure-side part should further decompose into subdomains such as:

fabric topology and port intent
L2 segmentation
L3 routing and edge policy
inter-datacenter transport
rollout safety and drift detection

Nett should own:

switch configuration generation and rollout
VLAN and VRF lifecycle
routed tenant networks
load balancer IP pools
VPN ingress/egress
firewall policy intent
IPAM and DNS integration

This service is required if compute, Kubernetes, and tenant networking are to be automated coherently.

The first version should prioritize infrastructure-side Nett before expanding the full provider-side surface.

Language and implementation guidance

The platform does not require a single language everywhere, but language choice should be deliberate.

Go

Advantages:

strong fit for networked services, CLIs, controllers, and infrastructure tooling
simple deployment and operational model
mature ecosystem for Kubernetes, OpenAPI, and service development

Likely uses:

Sentral APIs and control-plane services
Nett and Maskin control loops
CLI tools
operator-facing APIs and service daemons

Rust

Advantages:

strong correctness and memory-safety properties
good fit where reliability and performance matter strongly
attractive for security-sensitive or high-throughput components

Tradeoffs:

steeper learning curve
slower iteration for some teams and tooling flows

Likely uses:

selected performance- or correctness-critical services
protocol-heavy components
security-sensitive agents or infrastructure workers

Python

Advantages:

fast iteration
strong automation ecosystem
useful for glue code, discovery, validation, and operational tooling

Tradeoffs:

weaker deployment discipline if allowed to sprawl
easier to accumulate inconsistent runtime behavior

Likely uses:

prototypes
discovery scripts and validation jobs
operational tooling and migration utilities

Development and local test strategy

The platform must be developable without requiring a full physical datacenter for every change.

The development model should include:

local or CI-driven NATS and PostgreSQL
simulated or containerized service dependencies
virtualized Talos and VM test environments where possible
multi-site simulation for partition, failover, and recovery scenarios
lab environments for device and fabric integration testing
clear separation between unit tests, integration tests, and hardware-in-the-loop tests

The local development story should allow most API, task, inventory, and reconciliation work to be tested without access to production hardware.

Practical development environments may include:

Docker Compose For Sentral, NATS, PostgreSQL, supporting services, and most API- and workflow-level development.
Single-node Proxmox or equivalent virtualization host For VM lifecycle, Talos-in-VM testing, network attachment experiments, and operator workflow validation on one machine.
Multi-site simulation environment For cross-site partition testing, failover exercises, delayed-message behavior, and recovery drills that cannot be represented honestly in a single-site environment.
Small physical lab Such as a single-board-computer cluster or a small number of low-cost nodes for hardware-adjacent provisioning, discovery, and failure testing.

Maskin

Maskin should own:

hardware inventory
BMC integration and power control
PXE/iPXE workflows
Talos machine provisioning
VM lifecycle
bare-metal reservation and allocation

Maskin is the bridge between physical infrastructure and higher-level services.

For workflows that depend on network state, Maskin should distinguish between:

work that requires fresh or validated Nett changes before provisioning can start
work that may proceed using already-approved and already-realized network state
work that must enter a dependency-blocked state when network correctness cannot be validated

Plattform

Plattform should own:

lifecycle of Kubernetes management and workload clusters
Talos cluster bootstrap and upgrades
CNI, CSI, ingress, and policy defaults
cluster classes / templates
managed Kubernetes tenant offering

Kubernetes tenant isolation posture

The platform must be explicit about whether tenants receive dedicated clusters, shared-cluster isolation, or both.

The recommended v1 posture is:

dedicated Kubernetes clusters are the default tenant-facing offering for workloads that require strong isolation or clearer operational boundaries
shared clusters, if introduced, should be treated as a later or explicitly limited offering rather than the default assumption
namespace isolation inside shared clusters is not by itself equivalent to tenant isolation for all security and operational purposes

This keeps the first Kubernetes product honest about isolation guarantees and avoids overcommitting Plattform to a high-complexity multi-tenant shared-cluster story too early.

Dedicated-cluster default

In v1, a managed Kubernetes cluster should normally be a project- or environment-scoped cluster with its own control plane and worker allocation model.

This gives Dataverket:

clearer security boundaries
simpler quota and capacity reasoning
cleaner upgrade and failure-domain handling
fewer cross-tenant policy interactions inside the same cluster

Shared-cluster caution

If Dataverket later offers shared clusters, that should be a distinct product posture with additional requirements such as:

strong namespace and network policy isolation
admission and policy enforcement suitable for multi-tenant clusters
careful resource quota and noisy-neighbor controls
explicit documentation that the isolation model differs from dedicated clusters

The platform should not imply that namespace scoping alone gives the same isolation properties as dedicated clusters.

Tjeneste

Tjeneste should own:

app deployment workflows
web hosting primitives
service templates
runtime binding to databases, secrets, object storage, and networking

Objekt

Objekt should own:

S3-compatible object storage lifecycle
buckets, access policies, quotas
object storage integration for apps and platform internals

Additional proposed services

Register: Harbor-backed container registry service
Database: managed PostgreSQL, MySQL, and possibly Redis
Logg or Observasjon: logs, metrics, traces, alerting

These can start as modules inside Sentral or Tjeneste, but should become explicit services once lifecycle complexity grows.

Internal services versus platform offerings

The first implementation should keep this distinction explicit:

Internal platform services

These are implementation and control-plane domains:

Sentral
Identitet
Maskin
Plattform
Nett
Objekt
later internal services such as Register, Database, or Observasjon if they need their own lifecycle

These services are responsible for orchestration, execution, inventory, policy enforcement, and operating shared infrastructure.

Platform offerings

These are the resource types and managed capabilities exposed through the public API:

projects and environments
tasks and audit visibility
virtual machines
bare-metal allocations where offered
Kubernetes clusters
logical networks
buckets and object storage access
databases
application deployment products
image registry access where offered

An internal service may back several offerings, and one offering may depend on several internal services. For example, a managed Kubernetes cluster is a public offering even though it depends on Sentral, Plattform, Maskin, Nett, Identitet, and storage components internally.

Cross-service dependency and degraded-mode model

Internal services must document dependencies per workflow, not only per service.

The architecture should classify dependencies at least as:

Hard dependency The workflow cannot proceed safely or correctly if the dependency is unavailable.
Soft dependency The workflow may continue in reduced mode, with some capability delayed or temporarily unavailable.
Reconciliation dependency The initial request may be accepted, but final convergence depends on the dependency returning later.

This classification should be explicit in task handling, operator visibility, and service implementation.

Degraded-mode requirements

For important workflows, the platform should define:

which dependencies must be healthy before work starts
which dependencies may fail after work starts without corrupting state
whether the workflow pauses, fails, or continues in reduced mode
what task state and operator-visible reason is exposed
what reconciliation path completes the work later

Retries are not enough. The platform must know whether retrying is meaningful while the dependency remains unavailable.

Example: Maskin depending on Nett

Maskin should not provision a server onto unknown or unvalidated network state merely because a retry budget exists.

The intended behavior should be:

if a server requires a new VLAN, routed attachment, or other fresh Nett action, the workflow is blocked until Nett validates and applies that prerequisite
if the server can use already-approved and already-realized network state, Maskin may proceed and record that it relied on existing network configuration
if Nett is unavailable and network correctness cannot be proven, the task must become visibly dependency-blocked rather than silently thrashing

The same pattern should apply across other boundaries such as Plattform to Maskin, Tjeneste to Objekt, and Sentral to execution services.

Storage platform direction

Storage is a first-class platform concern and cannot remain implicit inside VM, Kubernetes, or database services.

The platform needs at least three storage classes:

Object storage Owned by Objekt for S3-compatible data and artifacts.
Block storage Required for VMs, databases, and Kubernetes persistent volumes.
Shared filesystem or equivalent shared data service Required only where products explicitly need shared file semantics.

The first version does not yet choose a storage backend, but the architecture must assume:

block storage is required for VM and database products
Kubernetes needs a CSI-integrated persistent volume strategy
storage durability and replication are separate design concerns from compute placement
multi-datacenter failover claims are incomplete until storage replication and recovery are defined

Observability and operations

Observability is a platform prerequisite, not a late optional product.

The first implementation phases must provide:

centralized logs
metrics for services, jobs, and infrastructure
distributed tracing or equivalent workflow correlation where feasible
task and event visibility across NATS-driven workflows
health and drift visibility for provisioning, networking, and failover systems

Logg or Observasjon may start as a module, but the operational capability itself must exist in the early foundation phases.

Operator runbooks and day-2 operations

The platform needs an explicit day-2 operating model, not only dashboards and alerts.

For important failure classes and operational workflows, the architecture should define:

when the platform may remediate automatically
when it should pause and require operator approval
when it can only surface a condition and require manual intervention
how incidents are escalated based on severity and blast radius
which runbook or supported response path applies

Automatic versus manual response

Automatic remediation is appropriate when the action is low-risk, bounded, and reversible enough to trust repeatedly.

Examples may include:

bounded workflow retry
reconciliation after transient dependency recovery
safe replay of idempotent work
restart or failover of low-blast-radius internal components where policy allows it

Operator approval or manual intervention is more appropriate when the action may affect shared infrastructure, data durability, or cross-site behavior.

Examples may include:

site failover
destructive recovery actions
inventory trust overrides
sensitive network recovery operations
changes that cross tenant or datacenter blast-radius boundaries

Runbook posture

Important alerts, blocked workflows, and degraded conditions should map to explicit runbooks or supported operator procedures.

Those runbooks should define:

triggering condition
likely impact
immediate safe checks
supported API or CLI actions
whether escalation is required
expected audit trail and completion signal

The first operator surface does not need a sophisticated automation engine, but it should make the runbook path discoverable rather than relying on tribal memory.

Secrets and certificate management

Daily secrets handling is separate from break-glass access and must be designed explicitly.

The platform needs a secrets and certificate lifecycle covering:

service-to-service credentials
tenant-facing database and application credentials
API keys and automation tokens
TLS certificate issuance and rotation
encryption key management for platform components

Identitet provides auth context, but it does not replace a secrets system.

Service-to-service transport security

Service-to-service identity must translate into actual transport protection.

The platform should assume:

TLS on internal service communications where practical
mTLS or equivalent strong service authentication for high-trust internal paths
Kubernetes network policies or equivalent segmentation where workloads share clusters
explicit policy for service-to-service communication between datacenters

Identity without transport security is incomplete.

Platform security baseline

The platform needs a coherent security strategy spanning:

tenant and infrastructure network segmentation
API authentication and authorization
NATS authentication, subject authorization, and least-privilege publishing/subscribing
service-to-service identity and transport security
container and VM isolation expectations
image and artifact supply-chain trust

Secrets management is only one part of the security model. The platform should treat these concerns as one architecture stream rather than scattered implementation details.

Bare-metal and Talos design

Day-zero bootstrap model

The platform needs an explicit story for how Dataverket itself is brought up in a new datacenter before normal self-hosting workflows exist.

The day-zero posture should assume a temporary manual bootstrap chain that establishes the minimum dependencies required for Sentral and the management cluster.

Day-zero assumptions

Before Dataverket can manage itself normally, operators must provide or establish at least:

powered and cabled servers
basic switch and router reachability
management addressing for BMCs and network devices
a bootstrap DNS and name-resolution path
a bootstrap PKI or trust path sufficient to start internal services
a way to host initial install assets for Talos and bootstrap manifests

These are not long-term manual operations, but they are unavoidable prerequisites for the first bring-up of an empty site.

Recommended day-zero sequence

The intended first-site bootstrap sequence is:

Operators seed the initial inventory for racks, switches, links, BMC endpoints, and bootstrap hosts.
Operators establish the minimum out-of-band management and provisioning networks.
Operators bring up the first bootstrap services needed for DNS, image hosting, and Talos install assets.
Maskin uses BMC and iPXE workflows to install Talos onto the first management-cluster nodes.
Plattform bootstraps the first management Kubernetes cluster.
Operators deploy the first stateful control-plane dependencies required for self-hosting, including PostgreSQL, NATS, and any required secrets backend.
Operators deploy Sentral and the first operator-facing APIs onto the management environment.
Identitet integration is connected so that normal authentication replaces bootstrap-only access paths.
Once Sentral is healthy, subsequent inventory, platform, and product bring-up should move onto supported Dataverket workflows rather than continuing as ad hoc manual setup.

Bootstrap versus steady state

The day-zero model should distinguish clearly between:

Bootstrap dependencies The minimum external or manually established services needed to start the platform once.
Steady-state dependencies The services Dataverket expects to manage, upgrade, and recover during normal operation.

For example, bootstrap DNS or asset hosting may be simpler and more manual than the long-term managed equivalents. That is acceptable as long as the handoff into steady-state operation is designed explicitly.

Day-zero minimum viable control plane

The first goal is not “the whole product catalog”. The first goal is a minimal management plane that can take over further automation.

That minimum should include:

management cluster on Talos
PostgreSQL for control-plane state
NATS for orchestration
Sentral API reachability
enough identity integration to authenticate operators normally
enough observability to see whether the platform is healthy during continued bring-up

Until this minimum exists, the platform is still in bootstrap mode rather than normal operation.

Baseline decision

Talos Linux should be installed directly on servers. The platform should not depend on a general-purpose host OS.

Recommended provisioning model

Use network boot for installation, but do not assume diskless operation for Talos nodes.

Recommended flow:

Server powers on through BMC-managed boot order.
iPXE chainloads from the provisioning network.
Maskin identifies the server from MAC, serial, or BMC identity.
Maskin serves a per-node or per-role Talos installer configuration.
Talos installs to local disk.
Node reboots into the installed Talos system.
Plattform completes cluster bootstrap and joins the node.

Why not purely diskless Talos by default

PXE-booted stateless nodes sound attractive, but the default platform design should prefer installed Talos on local disk because:

Talos is designed around an installed, immutable OS model
Kubernetes nodes benefit from predictable local state for kubelet, images, and upgrades
diskless operation complicates reboot behavior, image caching, and failure domains
debugging and lifecycle management become harder early in the project

Diskless or ephemeral netboot nodes may still be valuable for:

rescue mode
hardware validation
installer environment
special-purpose stateless workers

So the recommended design is:

PXE/iPXE for provisioning
Talos installed onto local disk for normal operation
optional netboot mode for rescue and exceptional workloads

Provisioning components

Maskin will need:

DHCP for provisioning network
TFTP only if chainloading legacy PXE; otherwise prefer HTTP boot/iPXE
per-node boot scripts
image cache for Talos installer assets
BMC control for one-shot boot override
hardware discovery pipeline

Network and fabric automation plan

Goal

The platform must be able to configure switches and routers as part of workload provisioning rather than treating networking as a manual prerequisite.

Scope for Nett

Nett should provide declarative management for:

switch ports
MLAG or equivalent fabric relationships
VLAN creation
trunk and access profiles
routed uplinks
BGP for service and tenant routing where needed
VPN endpoints
ACL/firewall intent
public and private IP pools
DNS records tied to services and ingress

Implementation approach

The first implementation should be based on declarative intent + renderer + driver:

Sentral stores desired network intent.
Nett validates it against topology and policy.
Nett renders vendor-specific configuration.
Drivers push config over SSH, API, or NETCONF/RESTCONF depending on vendor.
Nett records realized state and drift.

Do not couple the whole platform to a single switch vendor. Model:

intent in a vendor-neutral schema
drivers per vendor or NOS
topology in inventory

Network architecture recommendation

Start with a simple and operable fabric:

dedicated out-of-band management network
dedicated provisioning network
underlay network for node-to-node transport
tenant VLANs initially, with VRFs when multi-tenant pressure requires them
BGP-based routed edges for Kubernetes load balancers, VPN, and public services
explicit inter-datacenter connectivity for replication, control-plane coordination, and failover traffic
datacenter-aware routing and policy so that services can be placed or failed over across sites

This keeps the first version realistic while leaving room for EVPN/VXLAN later if scale demands it.

Multi-datacenter design

Dataverket should treat each datacenter as a first-class failure domain.

The minimum supported topology is:

two or more datacenters
independent local network fabric per datacenter
controlled inter-datacenter links for control-plane and service replication traffic
resource inventory tagged with datacenter, rack, and failure-domain metadata

The platform should support:

placement of resources into a specific datacenter
replication-aware service design across datacenters
failover of selected services between datacenters
both datacenters carrying active workloads at the same time
active/passive failover at the service, project, or workload level where that keeps behavior understandable

Not every workload must be instantly multi-site, but the platform control plane, inventory, networking, and service APIs must all be datacenter-aware from the start.

NATS architecture plan

NATS should be the platform’s event bus and command backbone, not the long-term system of record.

Recommended usage:

Core NATS subjects for commands and events
JetStream for durable workflows
Request/reply for synchronous service interactions where latency matters
Key/value and object store only for small coordination data, not core inventory
Intra-datacenter NATS transport as the standard internal communication path within each site
Inter-datacenter NATS communication as the standard control-plane communication path between sites

NATS is an internal platform transport. Public clients should integrate through the Sentral-owned API and task resources rather than directly through NATS subjects.

The default model should therefore be:

each datacenter runs NATS as its local control-plane backbone
internal services use NATS for normal intra-site commands, events, and task signaling
cross-site NATS is used for explicit coordination, replication signaling, and failover workflows

Event model

Each service should use a subject naming convention such as:

dv.<service>.cmd.<action>
dv.<service>.evt.<entity>.<verb>
dv.task.evt.<verb>

Examples:

dv.maskin.cmd.provision
dv.nett.cmd.apply
dv.plattform.evt.cluster.ready
dv.database.evt.instance.failed

Reliability model

Desired state lives in PostgreSQL
Workflow state lives in PostgreSQL plus JetStream
Events are append-only facts for integration and orchestration
Consumers must be idempotent
Every command must carry correlation ID, tenant/project context, and actor identity
Datacenter identity must be part of placement and routing decisions where site locality matters

Cross-site NATS should be designed around explicit coordination between sites, not around the assumption that every subject is globally shared with uniform semantics.

Failure handling baseline

The first implementation must define baseline behavior for:

retry policy for transient failures
dead-letter handling for poison messages
task timeout and cancellation
operator visibility into stuck workflows
replay and reconciliation after service restart
dependency-blocked workflows when a required internal service is unavailable
degraded-mode status when work continues with reduced capability

Event-driven architecture without these controls is not acceptable for production infrastructure automation.

PostgreSQL data strategy

PostgreSQL is the default relational system of record, but it must not be treated as one implicit shared dependency.

The platform should assume:

database-per-service or schema-per-bounded-context rather than one shared undifferentiated database
explicit HA and failover design for control-plane PostgreSQL
backup and restore as first-class operational requirements
explicit replication posture for cross-site behavior rather than generic “HA” language
bounded-context-specific RPO and RTO classes
split-brain prevention and promotion safety as explicit design concerns

If Sentral’s relational state is unavailable, large parts of the control plane stop. That risk must be designed for directly.

The default architectural posture should be:

local high availability may justify synchronous behavior inside a site or failure domain
cross-site replication should default to asynchronous unless a specific bounded context justifies stronger consistency and accepts the latency and quorum tradeoff
the platform should prefer temporary unavailability over ambiguous dual-primary behavior during partition

Upgrade and migration strategy

The platform must support controlled upgrades for:

Sentral and other control-plane services
NATS infrastructure
PostgreSQL control-plane state
Talos-based clusters and nodes
internal APIs and message schema evolution

The baseline assumption should be:

rolling or staged upgrades where possible
explicit compatibility windows for API and event schema changes
tested migration paths for control-plane state
no dependence on flag-day upgrades across the whole platform

Disaster recovery scope

Disaster recovery must cover more than PostgreSQL alone.

The platform should eventually define recovery for:

PostgreSQL control-plane state
NATS JetStream durable workflow state
inventory and approval state
Talos machine and cluster configuration inputs
switch and router rendered configuration history
secrets backend state

Failover language and disaster recovery language must not be conflated. A service may fail over without full data recovery, and that distinction must remain explicit.

Inventory bootstrap and trustworthiness

Inventory quality is a critical dependency, especially for Nett and Maskin.

The first version should assume a hybrid bootstrap model:

manually curated initial inventory for sites, racks, links, devices, and port topology
auto-discovery where safe and useful, such as BMC discovery and hardware facts
drift detection between declared inventory and observed state
operator approval for sensitive topology changes before they become trusted source of truth

Inventory cannot start as magic auto-discovery, and it cannot remain undocumented manual folklore.

API, SDK, and CLI plan

Public control surface

Build one canonical control plane API in Sentral. Everything else should layer on top:

REST or gRPC public API
generated SDKs
dv CLI family
optional UI later

The public surface should describe platform offerings and control-plane concepts. It should not expose internal service boundaries one-to-one.

Recommended shape

Start with a versioned HTTP API for operator ergonomics
Use OpenAPI as the contract
Generate SDKs from the API spec
Keep async operations modeled as tasks with status polling in v1
Include rate limiting, throttling, and abuse protection as part of the public API baseline

API evolution should follow an explicit compatibility policy:

additive changes are allowed within a major version when they preserve existing behavior
breaking changes require a new major version
supported major versions should overlap for a defined migration window
SDK and CLI compatibility should track supported API majors explicitly

Internal NATS events remain important for orchestration and operator correlation, but they are not the required external client contract in v1.

Resource protection baseline

Tenant-facing APIs and control operations must support:

rate limiting
throttling of expensive operations
concurrency controls for destructive or high-impact workflows
protection against noisy-neighbor behavior at the API layer

The platform must not assume all clients are well-behaved.

Usage accounting and quotas

The platform still needs usage attribution and quota signals, but commercial billing is deferred until the core control plane is stable.

The near-term priority is:

quota enforcement inputs
operator-visible usage accounting
reliable resource attribution to tenant, project, and environment

Full billing architecture should follow later, not shape the earliest bounded contexts.

Image and artifact lifecycle

The platform needs explicit lifecycle ownership for:

Talos installer and upgrade images
VM base images
container images and registry content
provider artifacts used during provisioning and recovery

Harbor is only one part of the artifact story. Image provenance, promotion, replication between datacenters, and retirement policy must all be designed.

Data residency and placement policy

Because Dataverket is multi-datacenter, the platform must support policy around where data and workloads may live.

The architecture should support:

placing workloads in a specific datacenter
restricting data-bearing services to allowed sites
making failover behavior respect placement and residency rules
surfacing these constraints through APIs and operator tooling

At the public contract level this should eventually include desired placement policy, current site, and policy-compliant failover targets where relevant.

Not every tenant will need data residency guarantees, but the platform should be able to express them.

Capacity planning and overcommit

Quotas alone are not enough. The platform must also define:

placement policy under capacity pressure
whether and where CPU or memory overcommit is allowed
storage overcommit policy
spread versus bin-packing defaults
how site-level scarcity is surfaced to operators and users

These decisions directly affect placement, availability, and billing trustworthiness.

CLI mapping

The current accepted CLI names fit this service layout:

dv: projects, environments, quotas, modules, tasks
dvid: login, tokens, identity
dvce: bare metal, VM, hardware inventory
dvke: cluster lifecycle
dvapp: app and web hosting lifecycle
dvos: buckets and object storage access
dvnet: networks, VLANs, VPN, addresses

Product surface plan

The first service offerings should be staged rather than launched all at once.

Phase 1 offerings

bare-metal inventory and provisioning
managed Kubernetes clusters
VM instances
Harbor-backed image registry
PostgreSQL service
S3-compatible object storage
VLAN-backed logical networks

These are public offerings. They should be distinguished from the internal services needed to deliver them.

Phase 2 offerings

MySQL and Redis
web hosting / app deployment
VPN products for tenant ingress and admin access
managed load balancers
inter-datacenter placement and failover primitives

Phase 3 offerings

private service mesh or east-west connectivity products
self-service routed tenant networks
advanced policy products
cross-region or multi-site federation

Tenant onboarding and self-service flow

The platform should define one supported path from first login to first working workload.

That flow should be API-first and CLI-friendly, and should avoid undocumented operator intervention except where policy explicitly requires approval.

Onboarding goals

A tenant administrator should be able to:

authenticate through Identitet and obtain a supported API or CLI session
discover the tenants, projects, and roles they are allowed to manage
create or access an initial project
create an initial environment where the product requires environment scoping
see which platform offerings are enabled for that tenant or project
create the minimum prerequisite resources for a first workload
launch a first workload and inspect its resulting tasks, status, and dependencies

Recommended first-use sequence

The intended v1 onboarding sequence is:

The actor authenticates through dvid login or the equivalent OIDC-backed API flow.
Sentral resolves tenant and role context from Identitet.
The actor lists accessible tenants and selects one administrative scope.
The actor creates or opens a project under that tenant.
The actor creates one or more environments if the intended product is environment-scoped.
The actor queries which offerings, limits, placement rules, and required prerequisites apply to that project.
The actor creates the first prerequisite resources such as a logical network, object bucket, or database where the workload requires them.
The actor creates the first workload resource such as a VM, Kubernetes cluster, or application deployment.
The actor follows task state and dependency status through the public API or CLI until the resource becomes usable or enters a visible blocked state.

Minimum self-service baseline

The first self-service-capable platform release should support at least:

tenant and project discovery for an authenticated actor
project creation or clearly modeled operator-provided project bootstrap
environment creation where relevant
visibility into enabled offerings and prerequisite requirements
self-service creation of at least one networked workload path
task visibility for provisioning progress, dependency blocking, and policy failure

Self-service does not mean every tenant can create every resource without policy. It means the supported path and any approval points are explicit in the API and task model.

Prerequisite discovery

The public API should make it possible to discover:

whether the actor may create projects or only operate within existing ones
whether an environment is required for a given product
whether network creation is self-service or operator-gated
which quotas, placement rules, and product enablement flags apply
which dependencies must exist before a given workload request can succeed

Onboarding is not complete if tenants can create a VM request but cannot tell that a network, environment, or quota prerequisite is missing until deep in a failing workflow.

Operator-assisted onboarding

Some environments will require operator approval or operator-created baseline resources before full self-service is available.

That is acceptable, but the architecture should model it explicitly:

operator-created tenant records
operator-approved project bootstrap
pre-provisioned shared networks or storage classes
policy gates for high-impact offerings

These should appear as normal resources, approvals, and task states rather than as out-of-band support activity.

First working stack

The first end-to-end self-service story should be intentionally narrow.

A practical v1 target is:

create project
create environment
create logical network or use a pre-approved shared network
create VM or Kubernetes cluster
optionally create object bucket or database binding
observe resulting tasks and final usable endpoints

Once that path works coherently, broader onboarding for apps, databases, and advanced networking can extend the same model.

Proposed delivery roadmap

Phase 0: foundation

Phase 0 should not be treated as a flat list. Several items are prerequisites for others and should be delivered in dependency order.

Phase 0 dependency order

Service and resource foundations
- Accept service taxonomy, especially whether Nett is formalized
- Define API skeleton and core resource model before domain services harden their internals
- Define system of record domains and database boundaries
Inventory and failure-domain model
- Define hardware inventory schema
- Define datacenter and failure-domain inventory model
Control-plane communication and state
- Define NATS subject taxonomy and event envelope
- Define control-plane PostgreSQL HA and backup baseline
- Define observability baseline for logs, metrics, tracing, and task visibility
- Define secrets and certificate lifecycle baseline
Platform substrate selection
- Choose first supported switch/router vendors
- Choose first hypervisor or VM substrate
- Define first block-storage platform selection criteria
- Define initial block-storage backend shortlist

Phase 0 dependency notes

Service taxonomy should precede stable API, subject, and ownership naming.
The API skeleton and resource model should precede domain-specific internals so that tenancy, task, and public resource semantics do not drift.
System-of-record boundaries should be set before service databases and orchestration behavior harden.
Hardware inventory schema should precede the broader datacenter and failure-domain model because topology, placement, and site metadata build on inventory primitives.
Inventory and datacenter modeling should precede vendor selection for networking and substrate choices where those choices depend on topology, placement, and failure-domain assumptions.
Block-storage selection criteria and shortlist should be available before metal and platform-service implementation harden around accidental local-disk or hypervisor-default assumptions.
NATS, PostgreSQL, observability, and secrets are shared control-plane foundations and should be defined before Phase 1 implementation starts.

Phase 0 critical path

The minimum blocking chain into Phase 1 is:

formalize service taxonomy
define API and resource model
define system-of-record boundaries
define inventory and datacenter model
define control-plane communication and state baselines

Vendor and substrate selection are still Phase 0 work, but they should follow the earlier architectural constraints rather than drive them prematurely.

The roadmap phases below are capability waves, not strict finish-to-start gates for every work item. Several tracks must progress in parallel, and later platform capabilities depend on partial results from multiple earlier waves.

In particular:

metal, network, and storage work are parallel enabling tracks for platform services rather than isolated sequential product phases
Kubernetes, VM, and database offerings depend on progress across several tracks at once
CLI and SDK work should begin as soon as the first public API slice exists so that API ergonomics are validated continuously

Phase 1: control plane skeleton and API feedback loop

Publish the first Sentral API surface for tenants, projects, environments, tasks, inventory, and placement inputs
Implement dvid login
Implement task tracking, audit baseline, and first generated Go SDK
Implement early dv, dvce, dvke, and dvnet flows in parallel with the API
Stand up NATS, PostgreSQL, and observability as shared control-plane foundations
Expose inventory bootstrap workflows and operator-only APIs
Define the first operator control surface for tasks, health, drift, and datacenter visibility
Publish the first block-storage selection-criteria ADR and backend shortlist decision inputs

The CLI work in this phase is not polish deferred until the end. It is part of the primary feedback loop for validating API shape, task semantics, auth flows, and operator ergonomics.

Phase 2: metal enablement track

Build provisioning network
Implement BMC inventory and power control in Maskin
Implement iPXE boot service
Automate Talos installation to local disk
Establish management Kubernetes cluster on Talos
Integrate accepted break-glass workflow

Phase 3: network enablement track

Implement Nett intent model
Add switch port and VLAN automation
Add IPAM and DNS workflow
Add router module for L3 gateways and public edge
Add VPN service primitives
Add inter-datacenter network connectivity and routing model

Phase 4: storage and persistence track

Deliver the selected first block-storage and persistent-volume strategy
Define VM disk lifecycle and backup/recovery baseline
Define database storage and replication baseline

Phase 5: platform service assembly

This phase depends on usable outputs from the metal, network, and storage tracks. It should not be read as meaning those tracks are fully complete before service work begins.

Bootstrap Harbor
Deliver object storage service
Deliver PostgreSQL service
Deliver Kubernetes cluster service
Deliver VM service

Phase 6: higher-level tenant and developer products

Add audit, usage, and task inspection
Expose datacenter-aware placement and failover controls
App hosting
managed databases beyond PostgreSQL
private networking products
higher-level developer workflows

Immediate architecture decisions still needed

The repository does not yet answer these questions, and they should become ADRs before deep implementation starts:

Is Nett an accepted top-level service name, matching dvnet?
Which switch and router vendors are in scope for v1?
What is the VM substrate for Maskin: KVM/libvirt, Proxmox, VMware, or something else?
Which database engines are in scope for v1?
Is Harbor a standalone service or a module under Tjeneste or Objekt?
Will the public API be REST, gRPC, or both?
What is the tenancy model: project-only, org/project, or org/project/environment?
What IPAM and DNS source of truth will be used?
Will load balancing be BGP-based, proxy-based, or both?
Which VPN technologies are in scope: WireGuard, IPsec, OpenVPN, or a mix?
Which services must support cross-datacenter failover in v1?
Which service types can be active in both datacenters simultaneously, and which must remain active/passive per workload?
What is the first block-storage and CSI strategy?
What is the daily secrets-management and certificate-rotation model?
What is the HA and backup strategy for control-plane PostgreSQL?
What is the first observability stack?
What is the first secrets backend?
What is the first block-storage backend shortlist?
What are the default overcommit and placement policies?

Summary

The shortest defensible path is:

use Talos on local disks, provisioned by PXE/iPXE
add Nett as a first-class network automation service
use NATS JetStream for orchestration, PostgreSQL for desired state
make datacenters explicit failure domains and use NATS as the standard communication path between them
allow both datacenters to carry live workloads, while keeping failover active/passive per service or workload where needed
treat storage, observability, secrets, and PostgreSQL availability as platform prerequisites rather than later add-ons
expose one canonical API and generate SDK/CLI on top
launch the platform in layers: control-plane foundation, metal, network, storage, platform products, then tenant-facing products

That sequence keeps complexity bounded while still matching the long-term platform ambition.