Bare-metal provisioning with iPXE and Talos MASKIN 006

Proposed Infrastructure Provisioning Bare metal Maskin Ipxe Talos

Defines iPXE network boot and installed Talos as the standard bare-metal provisioning path for Dataverket.

Author
Lars Solem
Updated

Status

Proposed on 2026-03-14 by Lars Solem.

Context

Dataverket needs a repeatable way to bring up physical servers for Kubernetes and compute without depending on manual installation flows or a mutable host operating system.

The repository already establishes Talos as an expected operating model, including break-glass integration for Talos-based systems. The open question is how servers should be provisioned and whether Talos should run diskless or be installed onto local storage.

Decision

Dataverket provisions bare-metal servers through network boot with iPXE, and installs Talos Linux onto local disk for normal operation.

The platform does not adopt diskless Talos as the default runtime model.

Provisioning flow

The standard server lifecycle is:

  1. Maskin discovers the server through BMC and inventory data.
  2. Maskin sets a one-time network boot override through the BMC when provisioning is requested.
  3. The server boots into iPXE from the provisioning network.
  4. Maskin serves a node-specific or role-specific Talos installer profile.
  5. The Talos installer writes the target system to local disk.
  6. The server reboots from local disk into installed Talos.
  7. Plattform applies cluster bootstrap or join operations.

Why iPXE

iPXE is chosen because it supports a practical modern provisioning model:

  • HTTP-based asset delivery
  • dynamic boot scripting
  • easier per-node logic than legacy PXE alone
  • compatibility with BMC-driven one-shot network boot workflows

Legacy PXE and TFTP may still be used only as a chainload path where hardware requires it.

Why installed Talos instead of diskless by default

Talos installed to local disk is the default because it gives the platform:

  • a stable and supported Talos lifecycle
  • predictable reboot behavior
  • simpler upgrades and rollback handling
  • local persistence for kubelet and image cache behavior
  • lower operational complexity during initial platform bring-up

Purely diskless Talos remains an optional mode for:

  • rescue environments
  • hardware diagnostics
  • installer environments
  • specialized stateless worker experiments

It is not the baseline contract for compute or Kubernetes products.

Identity and configuration model

Provisioning identity should be derived from hardware-backed attributes available before the OS is installed, such as:

  • BMC identity
  • serial number
  • MAC address
  • rack and port placement from inventory

Maskin generates the installer configuration and binds it to that hardware identity. Plattform owns higher-level cluster intent and post-install Talos cluster lifecycle.

Required provisioning components

The platform must provide:

  • a dedicated provisioning network
  • DHCP for provisioning
  • iPXE boot endpoint
  • HTTP image and config hosting
  • Talos image cache
  • BMC control for power and one-shot boot order
  • hardware discovery and reconciliation

Operational model

Bare-metal nodes are treated as disposable infrastructure:

  • reprovisioning must be routine
  • no manual host customization is allowed
  • drift is resolved by re-applying config or reinstalling the node
  • cluster membership is an orchestrated state transition, not a handcrafted procedure

Consequences

  • Talos becomes the only supported host OS for the primary bare-metal path
  • local disks are required for normal Talos operation
  • provisioning network design becomes a core dependency for the platform
  • Maskin must integrate with BMCs early
  • the diskless Talos idea is deferred to an explicit optional mode instead of shaping the primary architecture

Decision Outcome

Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.

  • This ADR complements the Talos break-glass architecture by defining the normal provisioning path.
  • Nett must provide the provisioning network and related switch configuration required by this flow.
  • A later ADR should define supported BMC vendors and Talos hardware assumptions.

Audit

  • 2026-03-14: ADR proposed.