| Field | Value |
|---|---|
| Status | Accepted |
| Date | 2026-03-03 |
| Deciders | Infrastructure / Platform team |
Context
Primary day-to-day access is governed through ZITADEL (OIDC IdP) and Teleport (access gateway), providing short-lived, auditable sessions with no persistent credentials. This ADR defines the out-of-band emergency access (“break-glass”) architecture used when the primary stack is unavailable.
The design must satisfy four invariants:
- Stateless sovereignty — no static
authorized_keysor K8s ServiceAccount tokens; access is validated mathematically via CAs. - Hardware-rooted identity — all user keys are non-exportable (Secure Enclave on macOS, YubiKey Bio on Linux).
- Decoupled trust — the offline CA (Nitrokey Start) is physically separated from the daily auth stack.
- Technical forward security — infrastructure is programmatically deaf to the offline CA until a signed, locally-delivered signal is detected on each node.
Component map
| Role | Implementation | Purpose |
|---|---|---|
| User identity | Secure Enclave (Mac) / YubiKey Bio (Linux) | Non-exportable “prover” private keys. |
| Offline CA | Nitrokey Start (primary + backup) | SSH & K8s certificate authority. |
| Identity provider | ZITADEL + Teleport | Daily OIDC-based access (not part of break-glass path). |
| Signal transport | S3 bucket / USB / BMC virtual media | Out-of-band signal delivery. |
| Sentinel | systemd timer + shell script | Polls for signal, gates access. |
Decisions
D1 — Dormant-state access gating
Use SSH AuthorizedPrincipalsFile and K8s ClusterRoleBinding as the gate.
- NormalOps: Nodes trust the Nitrokey CA public key, but the principals file and RBAC binding are empty. A validly signed certificate is still rejected.
- EmergOps: The Sentinel populates the allow-list only after verifying a signed signal.
No daemon or admission controller is required — sshd and kube-apiserver enforce the gate natively.
D2 — Signal delivery via USB / virtual media (NoCloud CIDATA)
The signal is a vfat or iso9660 filesystem with volume label CIDATA, following the cloud-init NoCloud Source 2 convention. It contains:
break-glass.json.sig— signed manifest with nonce,valid_from/valid_untiltimestamps, authorized principals, and a Nitrokey Ed25519 signature.revoked_keys(optional) — an OpenSSH KRL file. When present, the Sentinel writes it to/etc/ssh/revoked_keys, allowing CA revocation through the same out-of-band channel as activation (see D8).trusted-user-ca-keys.pem(optional) — replacement SSH CA public key. Delivered alongside a KRL during CA rotation (see D8).
Delivery options:
| Method | Target environment | Mechanism |
|---|---|---|
| BMC virtual media | Bare-metal with iDRAC / iLO / Supermicro | Redfish VirtualMedia.InsertMedia mounts an ISO remotely; OS sees a USB block device. No reboot. |
| Hypervisor attach | QEMU/KVM, VMware, Hyper-V | VM manager attaches ISO as virtual CD/USB. |
| Physical USB | Air-gapped or BMC-less hosts | Admin inserts a USB stick with CIDATA-labelled vfat partition. |
The Sentinel detects the CIDATA-labelled block device via periodic lsblk --fs poll, mounts it read-only, and verifies the manifest signature.
Why not IPMI in-band: requires vendor-specific kernel modules, can only carry small opaque byte strings (not a full manifest), and doesn’t work on VMs.
D3 — Dual-channel vs. degraded-mode activation
| Priority | Source | Requires network | Use case |
|---|---|---|---|
| 1 | S3 manifest + CIDATA media | Yes | Default: dual-channel verification. |
| 2 | CIDATA media alone | No | Infrastructure is down; USB carries full signed payload. |
Degraded mode exists because the break-glass event most likely occurs when infrastructure (including S3) is unavailable. The CIDATA image carries the complete signed manifest — it is not a key-ID reference that requires a network lookup.
D4 — SSH & K8s CA signing via PKCS#11
The Nitrokey Start is an OpenPGP smartcard. The CA private key never leaves the hardware token.
- SSH certificates:
ssh-keygen -swith the Nitrokey exposed via PKCS#11 (pkcs11-provideror OpenSC). Alternatively,step-cawith a PKCS#11 KMS backend. - K8s client certificates:
opensslwith PKCS#11 engine orstep certificate signsigns X.509 CSRs directly. Output is a client cert forkubectl/ kubeconfig.
D5 — Dual-Nitrokey redundancy
Two Nitrokey Start devices hold identical key material:
- Generate CA key on an air-gapped workstation.
- Load onto both Nitrokeys during a key ceremony.
- Destroy the air-gapped copy.
- Store the backup in a tamper-evident, physically secured location (e.g. safe deposit box).
Either device can independently sign valid certificates. Loss of the primary does not require re-provisioning the fleet’s trusted CA public keys.
D6 — Sentinel poll interval: 60 seconds
The Sentinel runs as a systemd timer with OnUnitActiveSec=60s.
- Activation latency: worst-case 60 s from CIDATA attach to access grant.
- TTL enforcement precision: access persists at most 60 s past expiry before the next poll wipes it.
- Load: one
lsblk --fsinvocation + at most oneblkid+ signature verification per minute. Negligible.
D7 — Clock-skew tolerance: absorb drift with a longer TTL
During a break-glass event, NTP may be unreachable and node clocks may drift. Rather than adding clock-skew compensation logic (monotonic counters, configurable tolerance windows), the simpler engineering decision is to sign the manifest and certificates with a TTL that absorbs realistic drift.
Drift budget: commodity server hardware (no NTP) drifts ≤ 1 s/day. Even after a week without NTP, drift is under 10 s. Virtualized clocks (TSC passthrough or kvm-clock) are tighter. The only realistic large-drift scenario is a manual clock misconfiguration, which is outside the threat model.
Decision: sign with an 8-hour valid_until window. This provides a comfortable 4 h of effective working time even under a hypothetical ± 2 h drift (far exceeding any realistic hardware clock skew). The manifest payload contains explicit valid_from and valid_until UTC timestamps. The Sentinel evaluates:
if LocalTime < valid_from → reject (too early / replay)
if LocalTime > valid_until → reject (expired)SSH certificates are signed with a matching validity interval (ssh-keygen -V +8h). K8s client certificates use the same NotAfter. No additional code, no configuration knob.
D8 — CA revocation via S3 / CIDATA KRL
If a Nitrokey is compromised, the CA must be revoked fleet-wide. The KRL and replacement CA public key are distributed through the same channels the Sentinel already polls — S3 as the primary path, CIDATA mount as the fallback.
Setup:
- Pre-provision an empty KRL file on every node at
/etc/ssh/revoked_keys. - Reference it in
sshd_config:RevokedKeys /etc/ssh/revoked_keys
Revocation procedure:
- Generate a new CA keypair on the air-gapped workstation. Load onto fresh Nitrokey pair. Destroy the air-gapped copy (same ceremony as D5).
- Generate a KRL revoking the compromised CA:
ssh-keygen -k -f revoked_keys -s /path/to/compromised-ca.pub - Primary — S3: Upload
revoked_keysandtrusted-user-ca-keys.pem(signed with the Minisign key) to the same S3 bucket the Sentinel already polls. Sentinel (next 60 s poll) fetches the files, verifies the Minisign signature, and writes them to/etc/ssh/. - Fallback — CIDATA mount: If S3 is unreachable, build a CIDATA ISO containing both files:
Mount fleet-wide via Redfishgenisoimage -V cidata -o revoke.iso revoked_keys trusted-user-ca-keys.pemVirtualMedia.InsertMediaor hypervisor attach. Sentinel detects the CIDATA image and applies the files identically.
The Sentinel verifies both channels using the pre-deployed Minisign verification key, which is independent of the SSH CA being rotated. No backup CAs need to be pre-deployed — the Minisign key is the root of trust, delivering a replacement CA public key alongside the KRL atomically. sshd re-reads both files on every connection; no restart required.
Kubernetes — replace --client-ca-file:
kube-apiserver does not support KRL natively. The Sentinel can write a replacement CA bundle from the CIDATA image, but the apiserver requires a restart to pick it up. On Talos this is a machine-config patch applied by the Sentinel; on kubeadm clusters it is a control-plane manifest update. Both are already part of the CA rotation runbook.
Why not OCSP / CRL responder: requires a running endpoint — exactly the infrastructure likely to be down during a break-glass event.
D9 — Talos Linux integration
Talos lacks a traditional shell. The Sentinel runs as a privileged container or external orchestrator, detects the CIDATA signal, and applies a Talos machine-config patch to enable the talosctl admin API for the Nitrokey-signed certificate.
Talos already trusts SMBIOS for nocloud machine configuration at boot, establishing a hardware-to-OS trust path. The Sentinel extends this trust model at runtime using USB/virtual media with a CIDATA-labelled filesystem.
D10 — Fleet activation orchestration
Mounting the CIDATA ISO on 20+ nodes uses a simple automation script:
- Bare-metal: Loop over BMC endpoints calling Redfish
VirtualMedia.InsertMediawith the ISO URL. A single pre-built ISO is served from an admin laptop or local HTTP server. - VMs: Hypervisor CLI/API attaches the ISO as a virtual CD-ROM (e.g.
virsh attach-disk,govc device.cdrom.insert, PowerCLINew-CDDrive). - Physical USB: For a handful of air-gapped nodes, manual insertion. Not intended for fleet-scale use.
The ISO is identical for all nodes — it contains no per-node state. The manifest’s authorized principals apply fleet-wide.
Architecture
Security lifecycle
Pre-provisioning
- Deploy Nitrokey CA public key to all nodes (
/etc/ssh/trusted-user-ca-keys.pem, kube-apiserver--client-ca-file). - Deploy empty KRL (
/etc/ssh/revoked_keys) and addRevokedKeysdirective tosshd_config. - Deploy Sentinel script, systemd timer, and manifest verification public key (Minisign/age).
- Nodes are now dormant — CA-trusting but principal-rejecting.
Break-glass event
- Retrieve Nitrokey. Sign a session manifest:
{ nonce, valid_from, valid_until, principals[] }. - Sign SSH certs (
ssh-keygen -V +8h) and/or K8s client certs with matching validity. - Push manifest to S3. Build CIDATA ISO (
genisoimage -V cidata -o signal.iso break-glass.json.sig [revoked_keys]). - Mount ISO fleet-wide via Redfish
VirtualMedia.InsertMediaor hypervisor attach. - Sentinel (next 60 s poll) detects CIDATA, verifies signature, populates principals/RBAC, and applies KRL if present.
- Operators connect with Nitrokey-signed SSH/K8s certificates.
Self-healing
- Sentinel checks
valid_untilon every 60 s cycle. OnceLocalTime > valid_until, it empties the principals file and deletes the RBAC binding. - The node returns to dormant state. No manual intervention required.
Consequences
Positive
- Zero standing access — no admin backdoor exists in
etcdor/etc/sshduring normal operations. - No network dependency for activation — CIDATA USB works when S3, DNS, and NTP are all down.
- No custom daemons —
sshdandkube-apiserverenforce access natively; the Sentinel is a stateless shell script on a timer. - No clock-skew logic — the 8 h TTL absorbs any realistic hardware drift without additional code.
- No custom revocation infrastructure — OpenSSH KRL is delivered through the same CIDATA channel as activation; no separate distribution path.
Negative
- Physical token dependency — break-glass is impossible without physical access to a Nitrokey. This is by design, but means geographic distribution of the backup token matters.
- ISO rebuild per event — each break-glass event requires signing a new manifest and building a new ISO. A small
make break-glassscript eliminates friction. - CA rotation is disruptive — revoking a compromised Nitrokey requires fleet-wide CIDATA mount (KRL for SSH) or config push (CA bundle for K8s). This is rare and reuses existing channels.