Stateless "Dormant" Break-Glass Architecture

FieldValue
StatusAccepted
Date2026-03-03
DecidersInfrastructure / Platform team

Context

Primary day-to-day access is governed through ZITADEL (OIDC IdP) and Teleport (access gateway), providing short-lived, auditable sessions with no persistent credentials. This ADR defines the out-of-band emergency access (“break-glass”) architecture used when the primary stack is unavailable.

The design must satisfy four invariants:

  1. Stateless sovereignty — no static authorized_keys or K8s ServiceAccount tokens; access is validated mathematically via CAs.
  2. Hardware-rooted identity — all user keys are non-exportable (Secure Enclave on macOS, YubiKey Bio on Linux).
  3. Decoupled trust — the offline CA (Nitrokey Start) is physically separated from the daily auth stack.
  4. Technical forward security — infrastructure is programmatically deaf to the offline CA until a signed, locally-delivered signal is detected on each node.

Component map

RoleImplementationPurpose
User identitySecure Enclave (Mac) / YubiKey Bio (Linux)Non-exportable “prover” private keys.
Offline CANitrokey Start (primary + backup)SSH & K8s certificate authority.
Identity providerZITADEL + TeleportDaily OIDC-based access (not part of break-glass path).
Signal transportS3 bucket / USB / BMC virtual mediaOut-of-band signal delivery.
Sentinelsystemd timer + shell scriptPolls for signal, gates access.

Decisions

D1 — Dormant-state access gating

Use SSH AuthorizedPrincipalsFile and K8s ClusterRoleBinding as the gate.

  • NormalOps: Nodes trust the Nitrokey CA public key, but the principals file and RBAC binding are empty. A validly signed certificate is still rejected.
  • EmergOps: The Sentinel populates the allow-list only after verifying a signed signal.

No daemon or admission controller is required — sshd and kube-apiserver enforce the gate natively.

D2 — Signal delivery via USB / virtual media (NoCloud CIDATA)

The signal is a vfat or iso9660 filesystem with volume label CIDATA, following the cloud-init NoCloud Source 2 convention. It contains:

  • break-glass.json.sig — signed manifest with nonce, valid_from/valid_until timestamps, authorized principals, and a Nitrokey Ed25519 signature.
  • revoked_keys (optional) — an OpenSSH KRL file. When present, the Sentinel writes it to /etc/ssh/revoked_keys, allowing CA revocation through the same out-of-band channel as activation (see D8).
  • trusted-user-ca-keys.pem (optional) — replacement SSH CA public key. Delivered alongside a KRL during CA rotation (see D8).

Delivery options:

MethodTarget environmentMechanism
BMC virtual mediaBare-metal with iDRAC / iLO / SupermicroRedfish VirtualMedia.InsertMedia mounts an ISO remotely; OS sees a USB block device. No reboot.
Hypervisor attachQEMU/KVM, VMware, Hyper-VVM manager attaches ISO as virtual CD/USB.
Physical USBAir-gapped or BMC-less hostsAdmin inserts a USB stick with CIDATA-labelled vfat partition.

The Sentinel detects the CIDATA-labelled block device via periodic lsblk --fs poll, mounts it read-only, and verifies the manifest signature.

Why not IPMI in-band: requires vendor-specific kernel modules, can only carry small opaque byte strings (not a full manifest), and doesn’t work on VMs.

D3 — Dual-channel vs. degraded-mode activation

PrioritySourceRequires networkUse case
1S3 manifest + CIDATA mediaYesDefault: dual-channel verification.
2CIDATA media aloneNoInfrastructure is down; USB carries full signed payload.

Degraded mode exists because the break-glass event most likely occurs when infrastructure (including S3) is unavailable. The CIDATA image carries the complete signed manifest — it is not a key-ID reference that requires a network lookup.

D4 — SSH & K8s CA signing via PKCS#11

The Nitrokey Start is an OpenPGP smartcard. The CA private key never leaves the hardware token.

  • SSH certificates: ssh-keygen -s with the Nitrokey exposed via PKCS#11 (pkcs11-provider or OpenSC). Alternatively, step-ca with a PKCS#11 KMS backend.
  • K8s client certificates: openssl with PKCS#11 engine or step certificate sign signs X.509 CSRs directly. Output is a client cert for kubectl / kubeconfig.

D5 — Dual-Nitrokey redundancy

Two Nitrokey Start devices hold identical key material:

  1. Generate CA key on an air-gapped workstation.
  2. Load onto both Nitrokeys during a key ceremony.
  3. Destroy the air-gapped copy.
  4. Store the backup in a tamper-evident, physically secured location (e.g. safe deposit box).

Either device can independently sign valid certificates. Loss of the primary does not require re-provisioning the fleet’s trusted CA public keys.

D6 — Sentinel poll interval: 60 seconds

The Sentinel runs as a systemd timer with OnUnitActiveSec=60s.

  • Activation latency: worst-case 60 s from CIDATA attach to access grant.
  • TTL enforcement precision: access persists at most 60 s past expiry before the next poll wipes it.
  • Load: one lsblk --fs invocation + at most one blkid + signature verification per minute. Negligible.

D7 — Clock-skew tolerance: absorb drift with a longer TTL

During a break-glass event, NTP may be unreachable and node clocks may drift. Rather than adding clock-skew compensation logic (monotonic counters, configurable tolerance windows), the simpler engineering decision is to sign the manifest and certificates with a TTL that absorbs realistic drift.

Drift budget: commodity server hardware (no NTP) drifts ≤ 1 s/day. Even after a week without NTP, drift is under 10 s. Virtualized clocks (TSC passthrough or kvm-clock) are tighter. The only realistic large-drift scenario is a manual clock misconfiguration, which is outside the threat model.

Decision: sign with an 8-hour valid_until window. This provides a comfortable 4 h of effective working time even under a hypothetical ± 2 h drift (far exceeding any realistic hardware clock skew). The manifest payload contains explicit valid_from and valid_until UTC timestamps. The Sentinel evaluates:

if LocalTime < valid_from  → reject  (too early / replay)
if LocalTime > valid_until → reject  (expired)

SSH certificates are signed with a matching validity interval (ssh-keygen -V +8h). K8s client certificates use the same NotAfter. No additional code, no configuration knob.

D8 — CA revocation via S3 / CIDATA KRL

If a Nitrokey is compromised, the CA must be revoked fleet-wide. The KRL and replacement CA public key are distributed through the same channels the Sentinel already polls — S3 as the primary path, CIDATA mount as the fallback.

Setup:

  1. Pre-provision an empty KRL file on every node at /etc/ssh/revoked_keys.
  2. Reference it in sshd_config:
    RevokedKeys /etc/ssh/revoked_keys

Revocation procedure:

  1. Generate a new CA keypair on the air-gapped workstation. Load onto fresh Nitrokey pair. Destroy the air-gapped copy (same ceremony as D5).
  2. Generate a KRL revoking the compromised CA:
    ssh-keygen -k -f revoked_keys -s /path/to/compromised-ca.pub
  3. Primary — S3: Upload revoked_keys and trusted-user-ca-keys.pem (signed with the Minisign key) to the same S3 bucket the Sentinel already polls. Sentinel (next 60 s poll) fetches the files, verifies the Minisign signature, and writes them to /etc/ssh/.
  4. Fallback — CIDATA mount: If S3 is unreachable, build a CIDATA ISO containing both files:
    genisoimage -V cidata -o revoke.iso revoked_keys trusted-user-ca-keys.pem
    Mount fleet-wide via Redfish VirtualMedia.InsertMedia or hypervisor attach. Sentinel detects the CIDATA image and applies the files identically.

The Sentinel verifies both channels using the pre-deployed Minisign verification key, which is independent of the SSH CA being rotated. No backup CAs need to be pre-deployed — the Minisign key is the root of trust, delivering a replacement CA public key alongside the KRL atomically. sshd re-reads both files on every connection; no restart required.

Kubernetes — replace --client-ca-file:

kube-apiserver does not support KRL natively. The Sentinel can write a replacement CA bundle from the CIDATA image, but the apiserver requires a restart to pick it up. On Talos this is a machine-config patch applied by the Sentinel; on kubeadm clusters it is a control-plane manifest update. Both are already part of the CA rotation runbook.

Why not OCSP / CRL responder: requires a running endpoint — exactly the infrastructure likely to be down during a break-glass event.

D9 — Talos Linux integration

Talos lacks a traditional shell. The Sentinel runs as a privileged container or external orchestrator, detects the CIDATA signal, and applies a Talos machine-config patch to enable the talosctl admin API for the Nitrokey-signed certificate.

Talos already trusts SMBIOS for nocloud machine configuration at boot, establishing a hardware-to-OS trust path. The Sentinel extends this trust model at runtime using USB/virtual media with a CIDATA-labelled filesystem.

D10 — Fleet activation orchestration

Mounting the CIDATA ISO on 20+ nodes uses a simple automation script:

  • Bare-metal: Loop over BMC endpoints calling Redfish VirtualMedia.InsertMedia with the ISO URL. A single pre-built ISO is served from an admin laptop or local HTTP server.
  • VMs: Hypervisor CLI/API attaches the ISO as a virtual CD-ROM (e.g. virsh attach-disk, govc device.cdrom.insert, PowerCLI New-CDDrive).
  • Physical USB: For a handful of air-gapped nodes, manual insertion. Not intended for fleet-scale use.

The ISO is identical for all nodes — it contains no per-node state. The manifest’s authorized principals apply fleet-wide.

Architecture

Operatorsshd / k8sSentinelUSB / Virtual MediaS3Nitrokey StartAdminOperatorsshd / k8sSentinelUSB / Virtual MediaS3Nitrokey StartAdminPre-provisioning (one-time)Break-glass eventalt[valid_from ≤ now ≤ valid_until][expired]loop[Every 60 s]Self-healing: access auto-revoked after valid_untilDeploy CA public key, empty KRL, empty principalsDeploy script + Minisign verification keySign session manifest + certs (8 h TTL)Upload signed manifestBuild CIDATA ISO (manifest [+ KRL])Mount ISO fleet-wide (Redfish / hypervisor)Fetch manifest (primary)Check for CIDATA block device (fallback)Verify Minisign signaturePopulate principals / RBAC bindingApply KRL + new CA key (if present)Empty principals, delete bindingConnect with Nitrokey-signed certAccess granted (principal match)

Security lifecycle

Pre-provisioning

  1. Deploy Nitrokey CA public key to all nodes (/etc/ssh/trusted-user-ca-keys.pem, kube-apiserver --client-ca-file).
  2. Deploy empty KRL (/etc/ssh/revoked_keys) and add RevokedKeys directive to sshd_config.
  3. Deploy Sentinel script, systemd timer, and manifest verification public key (Minisign/age).
  4. Nodes are now dormant — CA-trusting but principal-rejecting.

Break-glass event

  1. Retrieve Nitrokey. Sign a session manifest: { nonce, valid_from, valid_until, principals[] }.
  2. Sign SSH certs (ssh-keygen -V +8h) and/or K8s client certs with matching validity.
  3. Push manifest to S3. Build CIDATA ISO (genisoimage -V cidata -o signal.iso break-glass.json.sig [revoked_keys]).
  4. Mount ISO fleet-wide via Redfish VirtualMedia.InsertMedia or hypervisor attach.
  5. Sentinel (next 60 s poll) detects CIDATA, verifies signature, populates principals/RBAC, and applies KRL if present.
  6. Operators connect with Nitrokey-signed SSH/K8s certificates.

Self-healing

  • Sentinel checks valid_until on every 60 s cycle. Once LocalTime > valid_until, it empties the principals file and deletes the RBAC binding.
  • The node returns to dormant state. No manual intervention required.

Consequences

Positive

  • Zero standing access — no admin backdoor exists in etcd or /etc/ssh during normal operations.
  • No network dependency for activation — CIDATA USB works when S3, DNS, and NTP are all down.
  • No custom daemonssshd and kube-apiserver enforce access natively; the Sentinel is a stateless shell script on a timer.
  • No clock-skew logic — the 8 h TTL absorbs any realistic hardware drift without additional code.
  • No custom revocation infrastructure — OpenSSH KRL is delivered through the same CIDATA channel as activation; no separate distribution path.

Negative

  • Physical token dependency — break-glass is impossible without physical access to a Nitrokey. This is by design, but means geographic distribution of the backup token matters.
  • ISO rebuild per event — each break-glass event requires signing a new manifest and building a new ISO. A small make break-glass script eliminates friction.
  • CA rotation is disruptive — revoking a compromised Nitrokey requires fleet-wide CIDATA mount (KRL for SSH) or config push (CA bundle for K8s). This is rare and reuses existing channels.