Status
This document is a working platform plan derived from the accepted decisions in this repository and the current target architecture described by the team.
It is intentionally opinionated where the repository is silent, and those assumptions are called out as proposed decisions rather than accepted ones.
Inputs from existing repository decisions
The current repository establishes the following baseline:
- The parent system is Dataverket.
- The current accepted service names are:
- Sentral: control plane, orchestration
- Maskin: compute, VMs and bare metal
- Plattform: Kubernetes platform
- Identitet: identity and access management
- Tjeneste: application deployment
- Objekt: object storage
- The accepted CLI structure is a
dvcommand family:dv,dvid,dvce,dvke,dvapp,dvos,dvnet
- The accepted out-of-band access model is:
- ZITADEL + Teleport for normal access
- dormant break-glass via hardware-rooted CA
- Talos integration is explicitly expected
Problem statement
Dataverket needs to become a full datacenter automation platform that can:
- Install and control bare-metal servers
- Manage switching, VLANs, L3 gateways, VPN, and logical tenant networks
- Support two or more datacenters with failure handling and service failover between them
- Provision Kubernetes clusters, VMs, databases, object storage, container registry, and web hosting
- Expose all capabilities through an API, SDKs, and CLI tools
- Use NATS as the backbone for inter-service communication
Platform principles
The plan should follow these design principles:
API-first Every platform capability must exist as an internal API before it becomes a CLI command or UI feature.
Event-driven orchestration Commands produce desired state and lifecycle events on NATS; workers converge infrastructure toward that state.
Hardware-aware control plane The platform must understand BMCs, PXE/iPXE, switches, routers, hypervisors, and Talos node lifecycle.
Stateless node operating model Bare-metal servers should be treated as disposable and re-installable. Local disk may exist, but platform correctness must not depend on artisanal node state.
Strong tenancy boundaries Projects, environments, RBAC, network segmentation, secrets, and auditability must be built in from day one.
Sovereign operations Avoid dependencies on hosted control planes for core operation. External services may exist at the edge, not in the platform core.
Multi-datacenter by design The platform should model datacenters as first-class failure domains and support operation across two or more datacenters from the start.
Proposed target architecture
The architecture should distinguish clearly between:
Internal platform services Control-plane and execution services Dataverket runs to operate the platform itself.
Platform offerings The products and managed resource types Dataverket exposes to operators and tenants through the public API.
Those two layers are related, but they are not the same thing. Internal service boundaries should not be exposed blindly as public product boundaries.
The design must also distinguish between full availability, degraded operation, and dependency-blocked operation. Cross-service failure behavior should be designed explicitly rather than delegated to retries alone.
Control plane services
Sentral
Sentral is the system of record and orchestrator. It should own:
- projects, tenants, environments, quotas
- inventory and resource graph
- reconciliation workflow definitions
- audit log and task tracking
- public API entrypoint
Sentral should store desired state in PostgreSQL and publish commands/events on NATS JetStream.
Sentral should also understand datacenter placement, failure domains, and resource locality so that scheduling and failover decisions can span more than one site.
Sentral must not become a single undifferentiated monolith. Internally it should be treated as a set of bounded contexts:
- tenancy, projects, and environments
- inventory and resource graph
- task orchestration
- audit and policy-facing event history
- public API gateway and resource facade
Those areas may initially ship together, but they should have explicit internal boundaries and a path to later extraction if one part outgrows the others.
The inventory context should have its own PostgreSQL schema boundary from day 1, even if Sentral still ships as one process initially. The API facade should compose over that boundary rather than sharing mutable tables freely.
Sentral also needs an explicit consistency model:
- optimistic concurrency for ordinary resource edits
- transactional reservation semantics for scarce allocations
- task-driven saga orchestration across service boundaries
Without that, asynchronous workflows and PostgreSQL-backed desired state still leave core race conditions undefined.
Identitet
Identitet should provide:
- user, group, and machine identity
- OIDC via ZITADEL
- service-to-service identity
- token issuance for CLI and SDKs
- RBAC and authorization inputs
- operator and service authentication policy
Teleport remains the operational access path for humans and privileged workflows.
Identitet is under-specified in the current document set and should become an explicit ADR early. Identity, authentication, authorization, and service-to-service trust are too central to remain implicit.
Nett
The CLI ADR already introduces dvnet, but the service naming ADR does not yet define a corresponding service. This plan proposes Nett as an explicit platform service.
Nett should be split into two categories:
Infrastructure-side Nett Fabric and device-facing automation for switches, routers, underlay, and inter-datacenter connectivity.
Provider-side Nett Tenant-facing network products and abstractions built on top of the fabric.
The infrastructure-side part should further decompose into subdomains such as:
- fabric topology and port intent
- L2 segmentation
- L3 routing and edge policy
- inter-datacenter transport
- rollout safety and drift detection
Nett should own:
- switch configuration generation and rollout
- VLAN and VRF lifecycle
- routed tenant networks
- load balancer IP pools
- VPN ingress/egress
- firewall policy intent
- IPAM and DNS integration
This service is required if compute, Kubernetes, and tenant networking are to be automated coherently.
The first version should prioritize infrastructure-side Nett before expanding the full provider-side surface.
Language and implementation guidance
The platform does not require a single language everywhere, but language choice should be deliberate.
Go
Advantages:
- strong fit for networked services, CLIs, controllers, and infrastructure tooling
- simple deployment and operational model
- mature ecosystem for Kubernetes, OpenAPI, and service development
Likely uses:
- Sentral APIs and control-plane services
- Nett and Maskin control loops
- CLI tools
- operator-facing APIs and service daemons
Rust
Advantages:
- strong correctness and memory-safety properties
- good fit where reliability and performance matter strongly
- attractive for security-sensitive or high-throughput components
Tradeoffs:
- steeper learning curve
- slower iteration for some teams and tooling flows
Likely uses:
- selected performance- or correctness-critical services
- protocol-heavy components
- security-sensitive agents or infrastructure workers
Python
Advantages:
- fast iteration
- strong automation ecosystem
- useful for glue code, discovery, validation, and operational tooling
Tradeoffs:
- weaker deployment discipline if allowed to sprawl
- easier to accumulate inconsistent runtime behavior
Likely uses:
- prototypes
- discovery scripts and validation jobs
- operational tooling and migration utilities
Recommended posture:
- default to Go for core control-plane services and CLIs
- use Rust selectively where the additional rigor is justified
- use Python intentionally for tooling, automation, and experimentation rather than as the default for all long-lived services
Development and local test strategy
The platform must be developable without requiring a full physical datacenter for every change.
The development model should include:
- local or CI-driven NATS and PostgreSQL
- simulated or containerized service dependencies
- virtualized Talos and VM test environments where possible
- multi-site simulation for partition, failover, and recovery scenarios
- lab environments for device and fabric integration testing
- clear separation between unit tests, integration tests, and hardware-in-the-loop tests
The local development story should allow most API, task, inventory, and reconciliation work to be tested without access to production hardware.
Practical development environments may include:
Docker Compose For Sentral, NATS, PostgreSQL, supporting services, and most API- and workflow-level development.
Single-node Proxmox or equivalent virtualization host For VM lifecycle, Talos-in-VM testing, network attachment experiments, and operator workflow validation on one machine.
Multi-site simulation environment For cross-site partition testing, failover exercises, delayed-message behavior, and recovery drills that cannot be represented honestly in a single-site environment.
Small physical lab Such as a single-board-computer cluster or a small number of low-cost nodes for hardware-adjacent provisioning, discovery, and failure testing.
Recommended posture:
- Docker Compose should be enough for most control-plane development
- a single-machine virtualization lab should cover most VM and cluster lifecycle testing
- a dedicated multi-site simulation layer should cover partition and failover behavior before any production multi-datacenter claims are trusted
- a small physical lab should be reserved for hardware-facing validation and edge cases that cannot be trusted in pure virtualization
Maskin
Maskin should own:
- hardware inventory
- BMC integration and power control
- PXE/iPXE workflows
- Talos machine provisioning
- VM lifecycle
- bare-metal reservation and allocation
Maskin is the bridge between physical infrastructure and higher-level services.
For workflows that depend on network state, Maskin should distinguish between:
- work that requires fresh or validated Nett changes before provisioning can start
- work that may proceed using already-approved and already-realized network state
- work that must enter a dependency-blocked state when network correctness cannot be validated
Plattform
Plattform should own:
- lifecycle of Kubernetes management and workload clusters
- Talos cluster bootstrap and upgrades
- CNI, CSI, ingress, and policy defaults
- cluster classes / templates
- managed Kubernetes tenant offering
Kubernetes tenant isolation posture
The platform must be explicit about whether tenants receive dedicated clusters, shared-cluster isolation, or both.
The recommended v1 posture is:
- dedicated Kubernetes clusters are the default tenant-facing offering for workloads that require strong isolation or clearer operational boundaries
- shared clusters, if introduced, should be treated as a later or explicitly limited offering rather than the default assumption
- namespace isolation inside shared clusters is not by itself equivalent to tenant isolation for all security and operational purposes
This keeps the first Kubernetes product honest about isolation guarantees and avoids overcommitting Plattform to a high-complexity multi-tenant shared-cluster story too early.
Dedicated-cluster default
In v1, a managed Kubernetes cluster should normally be a project- or environment-scoped cluster with its own control plane and worker allocation model.
This gives Dataverket:
- clearer security boundaries
- simpler quota and capacity reasoning
- cleaner upgrade and failure-domain handling
- fewer cross-tenant policy interactions inside the same cluster
Shared-cluster caution
If Dataverket later offers shared clusters, that should be a distinct product posture with additional requirements such as:
- strong namespace and network policy isolation
- admission and policy enforcement suitable for multi-tenant clusters
- careful resource quota and noisy-neighbor controls
- explicit documentation that the isolation model differs from dedicated clusters
The platform should not imply that namespace scoping alone gives the same isolation properties as dedicated clusters.
Tjeneste
Tjeneste should own:
- app deployment workflows
- web hosting primitives
- service templates
- runtime binding to databases, secrets, object storage, and networking
Objekt
Objekt should own:
- S3-compatible object storage lifecycle
- buckets, access policies, quotas
- object storage integration for apps and platform internals
Additional proposed services
- Register: Harbor-backed container registry service
- Database: managed PostgreSQL, MySQL, and possibly Redis
- Logg or Observasjon: logs, metrics, traces, alerting
These can start as modules inside Sentral or Tjeneste, but should become explicit services once lifecycle complexity grows.
Internal services versus platform offerings
The first implementation should keep this distinction explicit:
Internal platform services
These are implementation and control-plane domains:
- Sentral
- Identitet
- Maskin
- Plattform
- Nett
- Objekt
- later internal services such as Register, Database, or Observasjon if they need their own lifecycle
These services are responsible for orchestration, execution, inventory, policy enforcement, and operating shared infrastructure.
Platform offerings
These are the resource types and managed capabilities exposed through the public API:
- projects and environments
- tasks and audit visibility
- virtual machines
- bare-metal allocations where offered
- Kubernetes clusters
- logical networks
- buckets and object storage access
- databases
- application deployment products
- image registry access where offered
An internal service may back several offerings, and one offering may depend on several internal services. For example, a managed Kubernetes cluster is a public offering even though it depends on Sentral, Plattform, Maskin, Nett, Identitet, and storage components internally.
Cross-service dependency and degraded-mode model
Internal services must document dependencies per workflow, not only per service.
The architecture should classify dependencies at least as:
Hard dependency The workflow cannot proceed safely or correctly if the dependency is unavailable.
Soft dependency The workflow may continue in reduced mode, with some capability delayed or temporarily unavailable.
Reconciliation dependency The initial request may be accepted, but final convergence depends on the dependency returning later.
This classification should be explicit in task handling, operator visibility, and service implementation.
Degraded-mode requirements
For important workflows, the platform should define:
- which dependencies must be healthy before work starts
- which dependencies may fail after work starts without corrupting state
- whether the workflow pauses, fails, or continues in reduced mode
- what task state and operator-visible reason is exposed
- what reconciliation path completes the work later
Retries are not enough. The platform must know whether retrying is meaningful while the dependency remains unavailable.
Example: Maskin depending on Nett
Maskin should not provision a server onto unknown or unvalidated network state merely because a retry budget exists.
The intended behavior should be:
- if a server requires a new VLAN, routed attachment, or other fresh Nett action, the workflow is blocked until Nett validates and applies that prerequisite
- if the server can use already-approved and already-realized network state, Maskin may proceed and record that it relied on existing network configuration
- if Nett is unavailable and network correctness cannot be proven, the task must become visibly dependency-blocked rather than silently thrashing
The same pattern should apply across other boundaries such as Plattform to Maskin, Tjeneste to Objekt, and Sentral to execution services.
Storage platform direction
Storage is a first-class platform concern and cannot remain implicit inside VM, Kubernetes, or database services.
The platform needs at least three storage classes:
Object storage Owned by Objekt for S3-compatible data and artifacts.
Block storage Required for VMs, databases, and Kubernetes persistent volumes.
Shared filesystem or equivalent shared data service Required only where products explicitly need shared file semantics.
The first version does not yet choose a storage backend, but the architecture must assume:
- block storage is required for VM and database products
- Kubernetes needs a CSI-integrated persistent volume strategy
- storage durability and replication are separate design concerns from compute placement
- multi-datacenter failover claims are incomplete until storage replication and recovery are defined
Observability and operations
Observability is a platform prerequisite, not a late optional product.
The first implementation phases must provide:
- centralized logs
- metrics for services, jobs, and infrastructure
- distributed tracing or equivalent workflow correlation where feasible
- task and event visibility across NATS-driven workflows
- health and drift visibility for provisioning, networking, and failover systems
Logg or Observasjon may start as a module, but the operational capability itself must exist in the early foundation phases.
Operator runbooks and day-2 operations
The platform needs an explicit day-2 operating model, not only dashboards and alerts.
For important failure classes and operational workflows, the architecture should define:
- when the platform may remediate automatically
- when it should pause and require operator approval
- when it can only surface a condition and require manual intervention
- how incidents are escalated based on severity and blast radius
- which runbook or supported response path applies
Automatic versus manual response
Automatic remediation is appropriate when the action is low-risk, bounded, and reversible enough to trust repeatedly.
Examples may include:
- bounded workflow retry
- reconciliation after transient dependency recovery
- safe replay of idempotent work
- restart or failover of low-blast-radius internal components where policy allows it
Operator approval or manual intervention is more appropriate when the action may affect shared infrastructure, data durability, or cross-site behavior.
Examples may include:
- site failover
- destructive recovery actions
- inventory trust overrides
- sensitive network recovery operations
- changes that cross tenant or datacenter blast-radius boundaries
Runbook posture
Important alerts, blocked workflows, and degraded conditions should map to explicit runbooks or supported operator procedures.
Those runbooks should define:
- triggering condition
- likely impact
- immediate safe checks
- supported API or CLI actions
- whether escalation is required
- expected audit trail and completion signal
The first operator surface does not need a sophisticated automation engine, but it should make the runbook path discoverable rather than relying on tribal memory.
Secrets and certificate management
Daily secrets handling is separate from break-glass access and must be designed explicitly.
The platform needs a secrets and certificate lifecycle covering:
- service-to-service credentials
- tenant-facing database and application credentials
- API keys and automation tokens
- TLS certificate issuance and rotation
- encryption key management for platform components
Identitet provides auth context, but it does not replace a secrets system.
Service-to-service transport security
Service-to-service identity must translate into actual transport protection.
The platform should assume:
- TLS on internal service communications where practical
- mTLS or equivalent strong service authentication for high-trust internal paths
- Kubernetes network policies or equivalent segmentation where workloads share clusters
- explicit policy for service-to-service communication between datacenters
Identity without transport security is incomplete.
Platform security baseline
The platform needs a coherent security strategy spanning:
- tenant and infrastructure network segmentation
- API authentication and authorization
- NATS authentication, subject authorization, and least-privilege publishing/subscribing
- service-to-service identity and transport security
- container and VM isolation expectations
- image and artifact supply-chain trust
Secrets management is only one part of the security model. The platform should treat these concerns as one architecture stream rather than scattered implementation details.
Bare-metal and Talos design
Day-zero bootstrap model
The platform needs an explicit story for how Dataverket itself is brought up in a new datacenter before normal self-hosting workflows exist.
The day-zero posture should assume a temporary manual bootstrap chain that establishes the minimum dependencies required for Sentral and the management cluster.
Day-zero assumptions
Before Dataverket can manage itself normally, operators must provide or establish at least:
- powered and cabled servers
- basic switch and router reachability
- management addressing for BMCs and network devices
- a bootstrap DNS and name-resolution path
- a bootstrap PKI or trust path sufficient to start internal services
- a way to host initial install assets for Talos and bootstrap manifests
These are not long-term manual operations, but they are unavoidable prerequisites for the first bring-up of an empty site.
Recommended day-zero sequence
The intended first-site bootstrap sequence is:
- Operators seed the initial inventory for racks, switches, links, BMC endpoints, and bootstrap hosts.
- Operators establish the minimum out-of-band management and provisioning networks.
- Operators bring up the first bootstrap services needed for DNS, image hosting, and Talos install assets.
- Maskin uses BMC and iPXE workflows to install Talos onto the first management-cluster nodes.
- Plattform bootstraps the first management Kubernetes cluster.
- Operators deploy the first stateful control-plane dependencies required for self-hosting, including PostgreSQL, NATS, and any required secrets backend.
- Operators deploy Sentral and the first operator-facing APIs onto the management environment.
- Identitet integration is connected so that normal authentication replaces bootstrap-only access paths.
- Once Sentral is healthy, subsequent inventory, platform, and product bring-up should move onto supported Dataverket workflows rather than continuing as ad hoc manual setup.
Bootstrap versus steady state
The day-zero model should distinguish clearly between:
Bootstrap dependencies The minimum external or manually established services needed to start the platform once.
Steady-state dependencies The services Dataverket expects to manage, upgrade, and recover during normal operation.
For example, bootstrap DNS or asset hosting may be simpler and more manual than the long-term managed equivalents. That is acceptable as long as the handoff into steady-state operation is designed explicitly.
Day-zero minimum viable control plane
The first goal is not “the whole product catalog”. The first goal is a minimal management plane that can take over further automation.
That minimum should include:
- management cluster on Talos
- PostgreSQL for control-plane state
- NATS for orchestration
- Sentral API reachability
- enough identity integration to authenticate operators normally
- enough observability to see whether the platform is healthy during continued bring-up
Until this minimum exists, the platform is still in bootstrap mode rather than normal operation.
Baseline decision
Talos Linux should be installed directly on servers. The platform should not depend on a general-purpose host OS.
Recommended provisioning model
Use network boot for installation, but do not assume diskless operation for Talos nodes.
Recommended flow:
- Server powers on through BMC-managed boot order.
- iPXE chainloads from the provisioning network.
- Maskin identifies the server from MAC, serial, or BMC identity.
- Maskin serves a per-node or per-role Talos installer configuration.
- Talos installs to local disk.
- Node reboots into the installed Talos system.
- Plattform completes cluster bootstrap and joins the node.
Why not purely diskless Talos by default
PXE-booted stateless nodes sound attractive, but the default platform design should prefer installed Talos on local disk because:
- Talos is designed around an installed, immutable OS model
- Kubernetes nodes benefit from predictable local state for kubelet, images, and upgrades
- diskless operation complicates reboot behavior, image caching, and failure domains
- debugging and lifecycle management become harder early in the project
Diskless or ephemeral netboot nodes may still be valuable for:
- rescue mode
- hardware validation
- installer environment
- special-purpose stateless workers
So the recommended design is:
- PXE/iPXE for provisioning
- Talos installed onto local disk for normal operation
- optional netboot mode for rescue and exceptional workloads
Provisioning components
Maskin will need:
- DHCP for provisioning network
- TFTP only if chainloading legacy PXE; otherwise prefer HTTP boot/iPXE
- per-node boot scripts
- image cache for Talos installer assets
- BMC control for one-shot boot override
- hardware discovery pipeline
Network and fabric automation plan
Goal
The platform must be able to configure switches and routers as part of workload provisioning rather than treating networking as a manual prerequisite.
Scope for Nett
Nett should provide declarative management for:
- switch ports
- MLAG or equivalent fabric relationships
- VLAN creation
- trunk and access profiles
- routed uplinks
- BGP for service and tenant routing where needed
- VPN endpoints
- ACL/firewall intent
- public and private IP pools
- DNS records tied to services and ingress
Implementation approach
The first implementation should be based on declarative intent + renderer + driver:
- Sentral stores desired network intent.
- Nett validates it against topology and policy.
- Nett renders vendor-specific configuration.
- Drivers push config over SSH, API, or NETCONF/RESTCONF depending on vendor.
- Nett records realized state and drift.
Do not couple the whole platform to a single switch vendor. Model:
- intent in a vendor-neutral schema
- drivers per vendor or NOS
- topology in inventory
Network architecture recommendation
Start with a simple and operable fabric:
- dedicated out-of-band management network
- dedicated provisioning network
- underlay network for node-to-node transport
- tenant VLANs initially, with VRFs when multi-tenant pressure requires them
- BGP-based routed edges for Kubernetes load balancers, VPN, and public services
- explicit inter-datacenter connectivity for replication, control-plane coordination, and failover traffic
- datacenter-aware routing and policy so that services can be placed or failed over across sites
This keeps the first version realistic while leaving room for EVPN/VXLAN later if scale demands it.
Multi-datacenter design
Dataverket should treat each datacenter as a first-class failure domain.
The minimum supported topology is:
- two or more datacenters
- independent local network fabric per datacenter
- controlled inter-datacenter links for control-plane and service replication traffic
- resource inventory tagged with datacenter, rack, and failure-domain metadata
The platform should support:
- placement of resources into a specific datacenter
- replication-aware service design across datacenters
- failover of selected services between datacenters
- both datacenters carrying active workloads at the same time
- active/passive failover at the service, project, or workload level where that keeps behavior understandable
Not every workload must be instantly multi-site, but the platform control plane, inventory, networking, and service APIs must all be datacenter-aware from the start.
NATS architecture plan
NATS should be the platform’s event bus and command backbone, not the long-term system of record.
Recommended usage:
- Core NATS subjects for commands and events
- JetStream for durable workflows
- Request/reply for synchronous service interactions where latency matters
- Key/value and object store only for small coordination data, not core inventory
- Intra-datacenter NATS transport as the standard internal communication path within each site
- Inter-datacenter NATS communication as the standard control-plane communication path between sites
NATS is an internal platform transport. Public clients should integrate through the Sentral-owned API and task resources rather than directly through NATS subjects.
The default model should therefore be:
- each datacenter runs NATS as its local control-plane backbone
- internal services use NATS for normal intra-site commands, events, and task signaling
- cross-site NATS is used for explicit coordination, replication signaling, and failover workflows
Event model
Each service should use a subject naming convention such as:
dv.<service>.cmd.<action>dv.<service>.evt.<entity>.<verb>dv.task.evt.<verb>
Examples:
dv.maskin.cmd.provisiondv.nett.cmd.applydv.plattform.evt.cluster.readydv.database.evt.instance.failed
Reliability model
- Desired state lives in PostgreSQL
- Workflow state lives in PostgreSQL plus JetStream
- Events are append-only facts for integration and orchestration
- Consumers must be idempotent
- Every command must carry correlation ID, tenant/project context, and actor identity
- Datacenter identity must be part of placement and routing decisions where site locality matters
Cross-site NATS should be designed around explicit coordination between sites, not around the assumption that every subject is globally shared with uniform semantics.
Failure handling baseline
The first implementation must define baseline behavior for:
- retry policy for transient failures
- dead-letter handling for poison messages
- task timeout and cancellation
- operator visibility into stuck workflows
- replay and reconciliation after service restart
- dependency-blocked workflows when a required internal service is unavailable
- degraded-mode status when work continues with reduced capability
Event-driven architecture without these controls is not acceptable for production infrastructure automation.
PostgreSQL data strategy
PostgreSQL is the default relational system of record, but it must not be treated as one implicit shared dependency.
The platform should assume:
- database-per-service or schema-per-bounded-context rather than one shared undifferentiated database
- explicit HA and failover design for control-plane PostgreSQL
- backup and restore as first-class operational requirements
- explicit replication posture for cross-site behavior rather than generic “HA” language
- bounded-context-specific RPO and RTO classes
- split-brain prevention and promotion safety as explicit design concerns
If Sentral’s relational state is unavailable, large parts of the control plane stop. That risk must be designed for directly.
The default architectural posture should be:
- local high availability may justify synchronous behavior inside a site or failure domain
- cross-site replication should default to asynchronous unless a specific bounded context justifies stronger consistency and accepts the latency and quorum tradeoff
- the platform should prefer temporary unavailability over ambiguous dual-primary behavior during partition
Upgrade and migration strategy
The platform must support controlled upgrades for:
- Sentral and other control-plane services
- NATS infrastructure
- PostgreSQL control-plane state
- Talos-based clusters and nodes
- internal APIs and message schema evolution
The baseline assumption should be:
- rolling or staged upgrades where possible
- explicit compatibility windows for API and event schema changes
- tested migration paths for control-plane state
- no dependence on flag-day upgrades across the whole platform
Disaster recovery scope
Disaster recovery must cover more than PostgreSQL alone.
The platform should eventually define recovery for:
- PostgreSQL control-plane state
- NATS JetStream durable workflow state
- inventory and approval state
- Talos machine and cluster configuration inputs
- switch and router rendered configuration history
- secrets backend state
Failover language and disaster recovery language must not be conflated. A service may fail over without full data recovery, and that distinction must remain explicit.
Inventory bootstrap and trustworthiness
Inventory quality is a critical dependency, especially for Nett and Maskin.
The first version should assume a hybrid bootstrap model:
- manually curated initial inventory for sites, racks, links, devices, and port topology
- auto-discovery where safe and useful, such as BMC discovery and hardware facts
- drift detection between declared inventory and observed state
- operator approval for sensitive topology changes before they become trusted source of truth
Inventory cannot start as magic auto-discovery, and it cannot remain undocumented manual folklore.
API, SDK, and CLI plan
Public control surface
Build one canonical control plane API in Sentral. Everything else should layer on top:
- REST or gRPC public API
- generated SDKs
dvCLI family- optional UI later
The public surface should describe platform offerings and control-plane concepts. It should not expose internal service boundaries one-to-one.
Recommended shape
- Start with a versioned HTTP API for operator ergonomics
- Use OpenAPI as the contract
- Generate SDKs from the API spec
- Keep async operations modeled as tasks with status polling in v1
- Include rate limiting, throttling, and abuse protection as part of the public API baseline
API evolution should follow an explicit compatibility policy:
- additive changes are allowed within a major version when they preserve existing behavior
- breaking changes require a new major version
- supported major versions should overlap for a defined migration window
- SDK and CLI compatibility should track supported API majors explicitly
Internal NATS events remain important for orchestration and operator correlation, but they are not the required external client contract in v1.
Resource protection baseline
Tenant-facing APIs and control operations must support:
- rate limiting
- throttling of expensive operations
- concurrency controls for destructive or high-impact workflows
- protection against noisy-neighbor behavior at the API layer
The platform must not assume all clients are well-behaved.
Usage accounting and quotas
The platform still needs usage attribution and quota signals, but commercial billing is deferred until the core control plane is stable.
The near-term priority is:
- quota enforcement inputs
- operator-visible usage accounting
- reliable resource attribution to tenant, project, and environment
Full billing architecture should follow later, not shape the earliest bounded contexts.
Image and artifact lifecycle
The platform needs explicit lifecycle ownership for:
- Talos installer and upgrade images
- VM base images
- container images and registry content
- provider artifacts used during provisioning and recovery
Harbor is only one part of the artifact story. Image provenance, promotion, replication between datacenters, and retirement policy must all be designed.
Data residency and placement policy
Because Dataverket is multi-datacenter, the platform must support policy around where data and workloads may live.
The architecture should support:
- placing workloads in a specific datacenter
- restricting data-bearing services to allowed sites
- making failover behavior respect placement and residency rules
- surfacing these constraints through APIs and operator tooling
At the public contract level this should eventually include desired placement policy, current site, and policy-compliant failover targets where relevant.
Not every tenant will need data residency guarantees, but the platform should be able to express them.
Capacity planning and overcommit
Quotas alone are not enough. The platform must also define:
- placement policy under capacity pressure
- whether and where CPU or memory overcommit is allowed
- storage overcommit policy
- spread versus bin-packing defaults
- how site-level scarcity is surfaced to operators and users
These decisions directly affect placement, availability, and billing trustworthiness.
CLI mapping
The current accepted CLI names fit this service layout:
dv: projects, environments, quotas, modules, tasksdvid: login, tokens, identitydvce: bare metal, VM, hardware inventorydvke: cluster lifecycledvapp: app and web hosting lifecycledvos: buckets and object storage accessdvnet: networks, VLANs, VPN, addresses
Product surface plan
The first service offerings should be staged rather than launched all at once.
Phase 1 offerings
- bare-metal inventory and provisioning
- managed Kubernetes clusters
- VM instances
- Harbor-backed image registry
- PostgreSQL service
- S3-compatible object storage
- VLAN-backed logical networks
These are public offerings. They should be distinguished from the internal services needed to deliver them.
Phase 2 offerings
- MySQL and Redis
- web hosting / app deployment
- VPN products for tenant ingress and admin access
- managed load balancers
- inter-datacenter placement and failover primitives
Phase 3 offerings
- private service mesh or east-west connectivity products
- self-service routed tenant networks
- advanced policy products
- cross-region or multi-site federation
Tenant onboarding and self-service flow
The platform should define one supported path from first login to first working workload.
That flow should be API-first and CLI-friendly, and should avoid undocumented operator intervention except where policy explicitly requires approval.
Onboarding goals
A tenant administrator should be able to:
- authenticate through Identitet and obtain a supported API or CLI session
- discover the tenants, projects, and roles they are allowed to manage
- create or access an initial project
- create an initial environment where the product requires environment scoping
- see which platform offerings are enabled for that tenant or project
- create the minimum prerequisite resources for a first workload
- launch a first workload and inspect its resulting tasks, status, and dependencies
Recommended first-use sequence
The intended v1 onboarding sequence is:
- The actor authenticates through
dvid loginor the equivalent OIDC-backed API flow. - Sentral resolves tenant and role context from Identitet.
- The actor lists accessible tenants and selects one administrative scope.
- The actor creates or opens a project under that tenant.
- The actor creates one or more environments if the intended product is environment-scoped.
- The actor queries which offerings, limits, placement rules, and required prerequisites apply to that project.
- The actor creates the first prerequisite resources such as a logical network, object bucket, or database where the workload requires them.
- The actor creates the first workload resource such as a VM, Kubernetes cluster, or application deployment.
- The actor follows task state and dependency status through the public API or CLI until the resource becomes usable or enters a visible blocked state.
Minimum self-service baseline
The first self-service-capable platform release should support at least:
- tenant and project discovery for an authenticated actor
- project creation or clearly modeled operator-provided project bootstrap
- environment creation where relevant
- visibility into enabled offerings and prerequisite requirements
- self-service creation of at least one networked workload path
- task visibility for provisioning progress, dependency blocking, and policy failure
Self-service does not mean every tenant can create every resource without policy. It means the supported path and any approval points are explicit in the API and task model.
Prerequisite discovery
The public API should make it possible to discover:
- whether the actor may create projects or only operate within existing ones
- whether an environment is required for a given product
- whether network creation is self-service or operator-gated
- which quotas, placement rules, and product enablement flags apply
- which dependencies must exist before a given workload request can succeed
Onboarding is not complete if tenants can create a VM request but cannot tell that a network, environment, or quota prerequisite is missing until deep in a failing workflow.
Operator-assisted onboarding
Some environments will require operator approval or operator-created baseline resources before full self-service is available.
That is acceptable, but the architecture should model it explicitly:
- operator-created tenant records
- operator-approved project bootstrap
- pre-provisioned shared networks or storage classes
- policy gates for high-impact offerings
These should appear as normal resources, approvals, and task states rather than as out-of-band support activity.
First working stack
The first end-to-end self-service story should be intentionally narrow.
A practical v1 target is:
- create project
- create environment
- create logical network or use a pre-approved shared network
- create VM or Kubernetes cluster
- optionally create object bucket or database binding
- observe resulting tasks and final usable endpoints
Once that path works coherently, broader onboarding for apps, databases, and advanced networking can extend the same model.
Proposed delivery roadmap
Phase 0: foundation
Phase 0 should not be treated as a flat list. Several items are prerequisites for others and should be delivered in dependency order.
Phase 0 dependency order
Service and resource foundations
- Accept service taxonomy, especially whether Nett is formalized
- Define API skeleton and core resource model before domain services harden their internals
- Define system of record domains and database boundaries
Inventory and failure-domain model
- Define hardware inventory schema
- Define datacenter and failure-domain inventory model
Control-plane communication and state
- Define NATS subject taxonomy and event envelope
- Define control-plane PostgreSQL HA and backup baseline
- Define observability baseline for logs, metrics, tracing, and task visibility
- Define secrets and certificate lifecycle baseline
Platform substrate selection
- Choose first supported switch/router vendors
- Choose first hypervisor or VM substrate
- Define first block-storage platform selection criteria
- Define initial block-storage backend shortlist
Phase 0 dependency notes
- Service taxonomy should precede stable API, subject, and ownership naming.
- The API skeleton and resource model should precede domain-specific internals so that tenancy, task, and public resource semantics do not drift.
- System-of-record boundaries should be set before service databases and orchestration behavior harden.
- Hardware inventory schema should precede the broader datacenter and failure-domain model because topology, placement, and site metadata build on inventory primitives.
- Inventory and datacenter modeling should precede vendor selection for networking and substrate choices where those choices depend on topology, placement, and failure-domain assumptions.
- Block-storage selection criteria and shortlist should be available before metal and platform-service implementation harden around accidental local-disk or hypervisor-default assumptions.
- NATS, PostgreSQL, observability, and secrets are shared control-plane foundations and should be defined before Phase 1 implementation starts.
Phase 0 critical path
The minimum blocking chain into Phase 1 is:
- formalize service taxonomy
- define API and resource model
- define system-of-record boundaries
- define inventory and datacenter model
- define control-plane communication and state baselines
Vendor and substrate selection are still Phase 0 work, but they should follow the earlier architectural constraints rather than drive them prematurely.
The roadmap phases below are capability waves, not strict finish-to-start gates for every work item. Several tracks must progress in parallel, and later platform capabilities depend on partial results from multiple earlier waves.
In particular:
- metal, network, and storage work are parallel enabling tracks for platform services rather than isolated sequential product phases
- Kubernetes, VM, and database offerings depend on progress across several tracks at once
- CLI and SDK work should begin as soon as the first public API slice exists so that API ergonomics are validated continuously
Phase 1: control plane skeleton and API feedback loop
- Publish the first Sentral API surface for tenants, projects, environments, tasks, inventory, and placement inputs
- Implement
dvid login - Implement task tracking, audit baseline, and first generated Go SDK
- Implement early
dv,dvce,dvke, anddvnetflows in parallel with the API - Stand up NATS, PostgreSQL, and observability as shared control-plane foundations
- Expose inventory bootstrap workflows and operator-only APIs
- Define the first operator control surface for tasks, health, drift, and datacenter visibility
- Publish the first block-storage selection-criteria ADR and backend shortlist decision inputs
The CLI work in this phase is not polish deferred until the end. It is part of the primary feedback loop for validating API shape, task semantics, auth flows, and operator ergonomics.
Phase 2: metal enablement track
- Build provisioning network
- Implement BMC inventory and power control in Maskin
- Implement iPXE boot service
- Automate Talos installation to local disk
- Establish management Kubernetes cluster on Talos
- Integrate accepted break-glass workflow
Phase 3: network enablement track
- Implement Nett intent model
- Add switch port and VLAN automation
- Add IPAM and DNS workflow
- Add router module for L3 gateways and public edge
- Add VPN service primitives
- Add inter-datacenter network connectivity and routing model
Phase 4: storage and persistence track
- Deliver the selected first block-storage and persistent-volume strategy
- Define VM disk lifecycle and backup/recovery baseline
- Define database storage and replication baseline
Phase 5: platform service assembly
This phase depends on usable outputs from the metal, network, and storage tracks. It should not be read as meaning those tracks are fully complete before service work begins.
- Bootstrap Harbor
- Deliver object storage service
- Deliver PostgreSQL service
- Deliver Kubernetes cluster service
- Deliver VM service
Phase 6: higher-level tenant and developer products
- Add audit, usage, and task inspection
- Expose datacenter-aware placement and failover controls
- App hosting
- managed databases beyond PostgreSQL
- private networking products
- higher-level developer workflows
Immediate architecture decisions still needed
The repository does not yet answer these questions, and they should become ADRs before deep implementation starts:
- Is Nett an accepted top-level service name, matching
dvnet? - Which switch and router vendors are in scope for v1?
- What is the VM substrate for Maskin: KVM/libvirt, Proxmox, VMware, or something else?
- Which database engines are in scope for v1?
- Is Harbor a standalone service or a module under Tjeneste or Objekt?
- Will the public API be REST, gRPC, or both?
- What is the tenancy model: project-only, org/project, or org/project/environment?
- What IPAM and DNS source of truth will be used?
- Will load balancing be BGP-based, proxy-based, or both?
- Which VPN technologies are in scope: WireGuard, IPsec, OpenVPN, or a mix?
- Which services must support cross-datacenter failover in v1?
- Which service types can be active in both datacenters simultaneously, and which must remain active/passive per workload?
- What is the first block-storage and CSI strategy?
- What is the daily secrets-management and certificate-rotation model?
- What is the HA and backup strategy for control-plane PostgreSQL?
- What is the first observability stack?
- What is the first secrets backend?
- What is the first block-storage backend shortlist?
- What are the default overcommit and placement policies?
Recommended next documents
To turn this plan into an executable architecture, the next ADRs should be:
003-development-and-test-environments.md004-implementation-language-guidance.md005-network-service-and-topology.md006-bare-metal-provisioning-with-ipxe-and-talos.md007-nats-subject-and-event-envelope.md008-public-api-style.md009-resource-inventory-and-tenancy-model.md010-supported-network-vendors.md011-vm-runtime-selection.md012-inter-datacenter-topology-and-failover.md015-storage-platform-and-persistence-strategy.md016-secrets-and-certificate-lifecycle.md017-observability-and-operations-baseline.md018-postgresql-control-plane-ha-and-backup.md019-workflow-retry-dead-letter-and-reconciliation.md020-sentral-internal-decomposition.md021-inventory-bootstrap-and-drift-management.md022-operator-visibility-and-control-surface.md023-identity-and-access-model.md024-platform-security-strategy.md025-upgrade-and-migration-strategy.md026-end-to-end-disaster-recovery.md027-api-rate-limiting-and-resource-protection.md028-image-and-artifact-lifecycle.md029-data-residency-and-placement-policy.md030-capacity-planning-and-overcommit-policy.md031-service-to-service-transport-security.md
Summary
The shortest defensible path is:
- use Talos on local disks, provisioned by PXE/iPXE
- add Nett as a first-class network automation service
- use NATS JetStream for orchestration, PostgreSQL for desired state
- make datacenters explicit failure domains and use NATS as the standard communication path between them
- allow both datacenters to carry live workloads, while keeping failover active/passive per service or workload where needed
- treat storage, observability, secrets, and PostgreSQL availability as platform prerequisites rather than later add-ons
- expose one canonical API and generate SDK/CLI on top
- launch the platform in layers: control-plane foundation, metal, network, storage, platform products, then tenant-facing products
That sequence keeps complexity bounded while still matching the long-term platform ambition.