VM runtime selection MASKIN 011

Proposed Infrastructure Compute Vendor selection Maskin Virtualization Runtime

Keeps the initial VM runtime open while defining the control-plane responsibilities and minimum capabilities the chosen substrate must support.

Author
Lars Solem
Updated

Status

Proposed on 2026-03-14 by Lars Solem.

Context

Dataverket needs a VM substrate for the Maskin service.

The platform already intends to manage bare metal directly and run Talos for Kubernetes-oriented hosts. What remains open is how virtual machines should be provisioned, scheduled, and controlled in a way that fits the rest of the control plane.

Decision

Dataverket keeps the v1 VM runtime and control surface unknown for now.

Maskin is responsible for:

  • hypervisor inventory
  • VM placement decisions
  • VM lifecycle management through the selected runtime
  • image preparation and attachment
  • network attachment through Nett-provided constructs

The platform does not yet commit to VMware, Proxmox, OpenStack, KVM with libvirt, or another VM control substrate.

Why keep this open

The repository does not yet establish enough constraints to lock the VM runtime responsibly.

The VM substrate choice affects:

  • host operating model
  • storage integration
  • live migration possibilities
  • failover behavior
  • operational tooling
  • how much Dataverket must build itself versus integrate

Locking a runtime too early would create unnecessary architectural drag.

What the eventual VM runtime must provide

The first selected VM runtime must support:

  • non-interactive automation
  • predictable VM lifecycle control
  • inventory integration
  • datacenter-aware placement
  • network attachment through Nett-managed constructs
  • image lifecycle integration
  • sufficient observability for task tracking and troubleshooting

Hypervisor model

The v1 hypervisor model is:

  • dedicated hypervisor hosts
  • a runtime-specific management interface selected later
  • host networking integrated with Nett-managed VLAN and bridge constructs where applicable

Each hypervisor host is part of operator-managed platform inventory, not a tenant-facing resource.

VM provisioning model

The standard VM lifecycle is:

  1. Sentral persists desired VM state.
  2. Maskin selects a suitable hypervisor in the requested datacenter and project constraints.
  3. Nett allocates or validates the required network attachment.
  4. Maskin creates or attaches the VM disk image.
  5. Maskin provisions the VM through the selected runtime.
  6. Maskin emits lifecycle events and task updates through NATS.

Image model

The v1 image model should support:

  • operator-managed base images
  • immutable image versioning
  • cloud-init or equivalent guest initialization where the guest OS requires it
  • Talos images where Talos is used inside VMs

Maskin should treat image definitions as platform-managed artifacts, not arbitrary user-provided ad hoc disks in the first version.

Scheduling model

VM placement must consider:

  • datacenter
  • hypervisor capacity
  • network attachment availability
  • anti-affinity or spread requirements where requested
  • future failover intent when applicable

The initial scheduler can be simple and deterministic. It does not need to be a general-purpose cluster scheduler in v1.

Explicit non-decisions for now

The following remain open until a later ADR:

  • exact VM runtime
  • exact hypervisor host OS
  • migration feature expectations
  • storage coupling model

Explicit non-goals for v1

The following are out of scope for the first version:

  • live migration as a hard requirement
  • tenant-direct hypervisor APIs

These may be reconsidered later if operational demand justifies them.

Consequences

  • Maskin gets a clear responsibility boundary, but not a locked runtime yet
  • VM runtime selection now needs a follow-up evaluation ADR before deep implementation starts
  • VM automation can still follow the same inventory and NATS patterns as bare metal
  • the team avoids locking itself to a substrate before storage and failover assumptions are clearer

Decision Outcome

Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.

More Information

  • VM platform selection criteria
  • guest image catalog model
  • storage model for VM disks
  • hypervisor host OS and hardening profile
  • live migration and failover policy

Audit

  • 2026-03-14: ADR proposed.