1. Overview
This document describes a bare metal provisioning system for Talos Linux nodes that participate in a BGP unnumbered IPv6-only network fabric. The system solves the chicken-and-egg problem of bootstrapping machines that have no network identity, no routable addresses, and no OS installed — into a fully meshed, self-healing IPv6 fabric.
Both delivery mechanisms present the same iPXE image to the machine’s UEFI firmware:
- Redfish API — iPXE mounted as virtual media via BMC (zero-touch, fully remote)
- USB stick — iPXE flashed to a physical USB drive (one physical touch)
The iPXE image is identical in both cases. The delivery method differs; the boot behavior does not.
Service Placement
This system is a Nett (network services) component. The entire bootstrap path is network infrastructure: ToR RA configuration, bootstrap prefix management, metadata server reachability via fabric routing, and machine identity → network identity mapping all live in Nett’s domain. Maskin (compute) consumes it — “provision machine X with role Y” → calls Nett’s metadata API → Nett handles boot/network plumbing.
2. Problem Statement
The target steady-state architecture is an IPv6-only network fabric where nodes establish BGP unnumbered sessions over link-local (fe80::) addresses, announce prefixes into the fabric, and form a fully routed mesh. No IPv4, no DHCP, no NAT.
At bare metal provisioning time: empty disk, no OS to run BGP, no addresses assigned, only IPv6 link-local connectivity. The fabric is self-bootstrapping once the OS is running, but a separate provisioning path is needed to break this cycle.
3. Design Goals
- Universal hardware support — works on any machine that can boot from USB or Redfish
- Zero-touch for BMC machines — fully remote provisioning via Redfish
- One physical touch for non-BMC machines — plug in USB, power on, done
- No network conflicts — provisioning uses the fabric’s own bootstrap prefix
- Reproducible and automatable — a single build tool generates all artifacts
- Secure by default — mTLS with per-tenant client certificates (step-ca); no unauthenticated access to machine configs
- Permanent boot interception — iPXE stays in the boot chain, enabling remote control of boot behavior
4. Architecture
4.1 How iPXE Gets Connectivity in the Fabric
Each server connects to two ToR switches via L3-only point-to-point links (BGP unnumbered). At boot, each link has only fe80:: link-local addresses. iPXE cannot use link-local in URLs (zone ID limitation) and cannot run BGP or any routing protocol — it is entirely passive.
The solution: ToRs advertise a routable bootstrap prefix via Router Advertisements on all server-facing ports. This is a lightweight, static ToR configuration:
# FRR on ToR — all server-facing interfaces
interface swp1-48
ipv6 nd prefix 2001:db8:boot:strap::/64 # or ULA for lab/testing
no ipv6 nd suppress-raThis is compatible with BGP unnumbered, which already uses RAs for link-local peer discovery.
Boot sequence:
iPXE autoconf6 on NIC0, NIC1
→ ToR-A sends RA: prefix 2001:db8:boot:strap::/64
→ iPXE SLAAC → gets routable address + default route via fe80::ToR
→ iPXE HTTPS GET https://[metadata-server]/v1/boot/${mac}
(routed: iPXE → fe80::ToR → fabric → metadata server)The fe80:: is used only at the routing layer (next-hop from RA) — never in a URL.
4.2 Metadata Server
The metadata server is the single bootstrap control plane — one well-known anycast address, reachable from any server-facing port via the ToR and fabric routing. It serves boot decisions, kernel/initramfs artifacts, and Talos machine configs — all from one service, identified by MAC.
Metadata Server(s) ── Anycast META::1/128 via BGP
│
[ Fabric ]
┌─────┐
ToR-A ToR-B (both send RA: BOOT::/64)
│ │
NIC0 NIC1
[ Server / iPXE (SLAAC) ]Multiple metadata servers announce the same /128 via BGP — the fabric does ECMP for HA. No load balancer needed.
4.3 Boot Decision Flow
The same iPXE image stays in the boot chain permanently — whether on a physical USB stick or mounted as BMC virtual media via Redfish. On every boot:
#!ipxe
# Get connectivity via ToR Router Advertisements
autoconf6 || true
# Machine identity: prefer SMBIOS UUID, fall back to MAC
isset ${uuid} && set id ${uuid} || set id ${mac}
# Ask metadata server: netboot or localboot?
# mTLS client cert authenticates this iPXE image to the metadata server
# Returns iPXE script (netboot) or 404/exit (localboot)
# Timeout (3s) falls through to localboot — safe default
chain --timeout 3000 https://[2001:db8:meta::1]/v1/boot/${id} || goto localboot
:localboot
exitWhen netbooting, the metadata server returns:
#!ipxe
kernel https://[2001:db8:meta::1]/v1/artifacts/vmlinuz talos.platform=metal talos.config=https://[2001:db8:meta::1]/v1/config/${id}
initrd https://[2001:db8:meta::1]/v1/artifacts/initramfs.xz
bootThis means:
- First boot (empty disk): metadata server returns netboot script → Talos provisions
- Normal operation: metadata server returns 404/exit → iPXE passes through to disk
- Reprovision: metadata server returns netboot script → wipe and reinstall
- The USB stick or BMC virtual media never needs to be removed
One iPXE image per tenant works for all machines in that tenant — differentiation is server-side by UUID.
4.4 Machine Lifecycle States
| State | Boot decision | Trigger |
|---|---|---|
| New | Netboot | Machine enrolled in inventory |
| Provisioning | Netboot | First boot, installing to disk |
| Active | Localboot | Provisioning complete, normal ops |
| Reprovision | Netboot | Operator/API requests wipe + reinstall |
| Maintenance | Netboot | Boot into rescue/diag image |
| Decommission | Netboot | Wipe disk, boot into secure erase |
iPXE is stateless — it asks the metadata server every time. State transitions are managed by Nett (or by Maskin calling into Nett’s API).
4.5 Nett / Maskin Integration
Nett owns the boot mechanism. Maskin calls into it for compute lifecycle:
Maskin: "provision machine X with role Y"
→ Nett metadata API: sets machine X to netboot with config Y
→ Machine boots, iPXE asks metadata server, gets config Y
→ Provisioned
→ Maskin: "machine X is active"
→ Nett metadata API: sets machine X to localboot5. Key Design Decisions
5.1 iPXE as the Universal Boot Shim
iPXE is used because it:
- Runs on virtually any UEFI or BIOS system
- Supports HTTP/HTTPS natively (with
DOWNLOAD_PROTO_HTTPS) - Supports IPv6 global unicast and ULA (not link-local — zone ID limitation)
- Can be embedded on a tiny USB stick (~1MB)
- Supports scripting logic — conditional branching, HTTP checks, timeouts, fallback
- Cannot run any routing protocol (BGP, OSPF) or send L2 discovery (LLDP) — entirely passive on the network control plane
Build requirements
| Feature | Build flag / config | Purpose |
|---|---|---|
| HTTPS support | DOWNLOAD_PROTO_HTTPS | Fetch from metadata server |
| IPv6 support | NET_PROTO_IPV6 | Bootstrap prefix connectivity |
| Embedded script | EMBED=script.ipxe | Bake in metadata server address |
| CA certificate | TRUST=root-ca.crt | Pin metadata server to platform CA (step-ca) |
| Client certificate | CERT=tenant.crt | mTLS — authenticates iPXE to metadata server |
| Client private key | PRIVKEY=tenant.key | mTLS — per-tenant key (step-ca issued) |
5.2 ToR Router Advertisements for Bootstrap Connectivity
In a dual-ToR BGP unnumbered fabric, each server port is an isolated L3 point-to-point link. The server has only fe80:: link-local addresses at boot. iPXE cannot use link-local in URLs (zone ID requirement) and cannot announce routes.
The ToRs solve this by always advertising a routable bootstrap prefix via RA on server-facing ports. This requires only a static configuration on the ToRs — compatible with BGP unnumbered, which already uses RAs for link-local peer discovery.
- iPXE receives the RA, does SLAAC, gets a routable address + default route via
fe80::ToR - Fully-booted Talos nodes ignore the RA (
accept_ra=0— standard for BGP unnumbered hosts) - Zero dynamic logic on the ToR, zero detection needed
- The bootstrap prefix is always there on every port, harmless to running nodes
5.3 Metadata Server API
The metadata server exposes three endpoints:
GET /v1/boot/{id}— boot decision (returns iPXE script or 404)GET /v1/artifacts/{filename}— kernel, initramfsGET /v1/config/{id}— full Talos machine config
The {id} is the machine’s SMBIOS UUID (preferred) or MAC address (fallback). All endpoints require a valid mTLS client certificate. The metadata server extracts the tenant identity from the certificate Subject and scopes responses accordingly — a tenant’s iPXE image can only access that tenant’s machine configs.
5.4 iPXE as Permanent Boot Interceptor
The iPXE shim remains in the boot chain permanently, not just for initial provisioning:
- The same iPXE image persists: USB stick stays plugged in / BMC virtual media stays mounted
- UEFI boot order: iPXE first, disk second
- iPXE
exitcommand returns control to UEFI → boots the next entry (disk) - Boot behavior changes without touching the machine — decision is server-side
This transforms iPXE from a one-time installer into a remote management plane for the boot process.
5.5 Addressing: Real IPv6 Prefixes (ULA for Lab/Testing Only)
Production environments use real IPv6 prefixes (PI or provider-assigned) for the ToR bootstrap RA. The metadata server has a real routable address. This is just IPv6 working as designed — no special addressing.
ULA (fd00::/8) is supported only for lab, homelab, and testing environments where allocated IPv6 space is not available. ULA works identically — the ToR advertises a ULA prefix via RA, iPXE does SLAAC, the metadata server has a ULA address. The only difference is the addresses are not globally routable.
6. Delivery Mechanisms
Both paths deliver the same per-tenant iPXE image — identical for all machines within a tenant, differing only in the embedded client certificate. The delivery method is the only difference.
Redfish Path (BMC-equipped machines)
| Step | Action |
|---|---|
| Build iPXE image | Per-tenant image with embedded client cert + metadata server address |
| Deliver to machine | POST /redfish/v1/Managers/.../VirtualMedia — mount ISO via HTTP |
| Set boot override | PATCH /redfish/v1/Systems/.../Boot → BootSourceOverrideTarget |
| Trigger boot | POST /redfish/v1/Systems/.../Actions/ComputerSystem.Reset |
| Physical touch | None |
USB Path (non-BMC machines)
| Step | Action |
|---|---|
| Build iPXE image | Same per-tenant image as Redfish path |
| Deliver to machine | Flash .img to USB stick |
| Set boot override | Configure BIOS to boot from USB (once) |
| Trigger boot | Power on |
| Physical touch | Once — plug in USB, configure boot order, power on. USB stays permanently. |
7. Per-Node Configuration
In metadata server mode, one iPXE image per tenant works for all machines in that tenant. Per-node differentiation happens server-side based on SMBIOS UUID (or MAC fallback).
Machine config contents (per node)
Each Talos machine config, served by the metadata server, includes:
- Node role — control plane or worker
- Production IPv6 prefix — the address announced into the fabric via BGP
- BGP speaker configuration (FRR or similar) as a system extension or static pod
- BGP unnumbered peering over link-local on fabric-facing interfaces
- Kubernetes CNI configuration wired into the IPv6 fabric
- Cluster bootstrap secrets (certificates, tokens, encryption keys)
Once Talos boots with this config, it establishes BGP sessions with the ToRs, announces its production prefix, and the node joins the fabric. The bootstrap prefix from the ToR RA becomes irrelevant (ignored with accept_ra=0).
8. Security
8.1 mTLS with step-ca
All communication between iPXE and the metadata server is authenticated via mutual TLS (mTLS). Certificates are issued by step-ca, a lightweight ACME-capable certificate authority.
PKI structure:
Platform Root CA (step-ca)
├── Metadata server cert (server identity, auto-renewed via ACME)
├── Tenant-A client cert (baked into Tenant-A iPXE images)
├── Tenant-B client cert (baked into Tenant-B iPXE images)
└── ...At build time:
# Issue per-tenant client cert
step ca certificate "tenant-a.boot.dataverket.no" tenant-a.crt tenant-a.key \
--provisioner boot-provisioner --not-after 8760h
# Compile into iPXE
make bin-x86_64-efi/ipxe.efi \
EMBED=boot.ipxe TRUST=root-ca.crt CERT=tenant-a.crt PRIVKEY=tenant-a.keyAt the metadata server:
# nginx example
ssl_verify_client on;
ssl_client_certificate /etc/ssl/root-ca.crt;The server extracts tenant identity from the client certificate Subject and scopes all responses to that tenant.
8.2 Security Properties
| Property | Status |
|---|---|
| Metadata API authentication | ✅ mTLS — per-tenant client cert (step-ca) |
| Cross-tenant isolation | ✅ Cert Subject → tenant scoping; wrong cert = no access |
| Certificate pinning | ✅ TRUST=root-ca.crt — only the platform CA is trusted |
| Rogue metadata server prevention | ✅ iPXE rejects servers not signed by platform CA |
| Machine identity | SMBIOS UUID (preferred) or MAC (fallback) |
| No unauthenticated endpoints | ✅ TLS handshake fails without valid client cert |
talos.config= visible in /proc/cmdline | ⚠️ Low risk — Talos has no shell access |
| Optional: encrypted machine configs | ✅ Talos supports TPM-bound decryption |
| Cert rotation | Rebuild iPXE images with new cert (aligns with firmware update cycle) |
9. Build Tool UX
$ talos-fabric-builder --nodes nodes.yaml --metadata-server 2001:db8:meta::1
Registering machine configs... 3 nodes registered
Uploading artifacts... vmlinuz, initramfs.xz done
Redfish provisioning... node-1, node-2 rebooted
USB image... usb-tenant-a.img written
3 nodes ready (2 Redfish, 1 USB)The build tool generates one iPXE image per tenant (differing only in embedded client cert) and registers per-node configs with the metadata server. BMC machines are provisioned automatically via Redfish; non-BMC machines get the same per-tenant USB image.
10. Open Questions
10.1 Bootstrap Phase for the First Metadata Server
The metadata server itself needs to be running and reachable via the fabric before any other machine can bootstrap. This implies:
- The first metadata server (and step-ca) are bootstrapped differently (e.g., from USB with a full Talos ISO or a direct-attached local server)
- Or: the metadata server runs outside the fabric (on a management host or VM) and is reachable via a static route
10.2 Production Address Assignment
How does the metadata server assign production IPv6 prefixes to nodes?
- Static mapping — prefix assigned per-node in a config file (Netbox integration)
- Dynamic allocation — metadata server allocates from a pool at first boot
- Hybrid — Netbox has the allocation, metadata server reads from Netbox API
11. Summary
Properties
| Property | Status |
|---|---|
| Works on any USB-bootable hardware | ✅ |
| Zero-touch for BMC machines (Redfish) | ✅ |
| One-touch for non-BMC machines (USB) | ✅ |
| No DHCP, PXE, TFTP, or firmware HTTP Boot | ✅ |
| Secure (mTLS via step-ca, per-tenant certs) | ✅ |
| One iPXE image per tenant, all machines | ✅ |
| Remote boot control (provision/localboot) | ✅ |
| Permanent boot interception | ✅ |
| Pure IPv6, fabric-native | ✅ |
| Anycast HA for metadata server | ✅ |
| Nett service, Maskin integration via API | ✅ |
Key Insights
ToR Router Advertisements + metadata server anycast solve the bootstrap connectivity problem cleanly within the existing fabric — no special infrastructure, no out-of-band networks, no internet dependency.
iPXE as a permanent boot interceptor transforms the USB stick or BMC virtual media from a one-time installer into a remote management plane, controllable by Nett (with Maskin calling into Nett for compute lifecycle).
The provisioning path is completely orthogonal to the production network. The bootstrap prefix from the ToR RA gets iPXE connected; the production prefix in the Talos config gets the node into the fabric via BGP. The boot interception mechanism allows Nett to force a return to provisioning at any time.
Appendix A: L2 Ethernet Plane Fabrics (Homelab / Test)
When to use
Production DC fabrics use L3-only point-to-point links between servers and dual ToRs (BGP unnumbered). This requires switches with per-port L3 routing — typically data center switches running SONiC, Cumulus, or similar.
In homelab and test environments, dedicated DC switches may not be available. An alternative is L2 Ethernet planes with a BGP mesh overlay. Each plane is a shared L2 broadcast domain (a commodity unmanaged switch), and nodes peer BGP directly with each other over the plane:
Plane A (L2 switch) Plane B (L2 switch)
┌──────────────────┐ ┌──────────────────┐
│ eth0 eth0 eth0│ │ eth1 eth1 eth1│
└──┬──────┬──────┬─┘ └──┬──────┬──────┬─┘
│ │ │ │ │ │
Node-1 Node-2 Node-3 Node-1 Node-2 Node-3Each node has one NIC per plane. Nodes form BGP sessions over each plane for redundancy. There are no dedicated router/switch control planes — the nodes are the routers.
Recommended: gateway node with metadata server
To keep the bootstrap architecture consistent with the L3 fabric design, designate one node (or a small dedicated machine) as the gateway node for each plane. This node runs:
- Router Advertisement daemon — advertises the bootstrap prefix on its plane interfaces
- Metadata server — serves boot decisions, artifacts, and machine configs
- BGP speaker — peers with the rest of the fabric, announces the metadata server’s
/128
Plane A (L2 switch)
┌────────────────────────────┐
│ eth0 eth0 eth0 eth0 │
└──┬───────┬──────┬──────┬───┘
│ │ │ │
Gateway Node-1 Node-2 Node-3
(RA + metadata) (iPXE → SLAAC → metadata)The gateway node sends RAs with the bootstrap prefix on its plane interface. New nodes running iPXE do autoconf6, get an address via SLAAC, and reach the metadata server — exactly like the L3 design where the ToR sends the RA, except here the gateway replaces the ToR in that role.
For the gateway node itself:
# radvd or FRR on the gateway — plane-facing interfaces
interface eth0
ipv6 nd prefix fd00:boot:lab:a::/64
no ipv6 nd suppress-ra
interface eth1
ipv6 nd prefix fd00:boot:lab:b::/64
no ipv6 nd suppress-raThis means:
- iPXE sees the same thing regardless of L3 or L2 fabric — an RA with a bootstrap prefix, a default route, and a metadata server URL
- The gateway node is bootstrapped first (manually via Talos ISO or pre-installed), then it bootstraps all other nodes — same chicken-and-egg resolution as Section 10.2
- Once all nodes are running BGP, the gateway is just another peer in the mesh — the metadata server’s
/128is announced via BGP like in production
What stays the same
Everything else is identical to the L3 fabric design: the iPXE image, the metadata server API, the boot decision flow, the Talos machine configs, and the lifecycle management. The gateway node on L2 fills the same role as the ToR on L3 — providing RAs and a route to the metadata server.