Conversation Summary (2026-03-03)

This document summarizes the architecture ideas, design decisions, and conclusions from the conversation. It is written to be reusable in future sessions.


1) Core Goal and Philosophy

We are designing an edge / DC platform to manage:

  • KVM/QEMU VMs (primary)
  • Bare metal workloads (also important)
  • Migration and backup across heterogeneous environments (VM↔VM, VM↔BM, BM↔BM)
  • Provisioning with strong security properties (TPM attestation, signed approvals)
  • A consistent “minimal code, minimal attack surface” approach

Guiding principles

  • Minimize custom LOC (reduce attack surface and operational complexity).
  • Prefer immutable OS images (Talos/Kairos evaluated; Talos highlighted for immutability/no SSH).
  • Use IPv6 and unnumbered patterns to reduce IP management.
  • Unify operations around NATS for control-plane and (optionally) data-plane transport.
  • Use EDA/DDD concepts: commands, queries, events.

2) Key Breakthrough: NBD-over-NATS for Universal Block Streaming

Learning / conclusion

The “genius” primitive is tunneling block device traffic (NBD-like semantics) over NATS so the same mechanism can support:

  • Live-ish migration streams
  • Disk copy / restore
  • Backups (e.g., to Kopia via stdin)
  • VM↔BM conversions
  • Provisioning (streaming into a RAM-booted installer)

This collapses many otherwise separate systems (storage replication, migration protocols, backup pipelines) into one reusable transport + protocol.

IPv6 relevance

  • Direct QEMU NBD over IPv6 is generally possible, but not required for the chosen design.
  • In the chosen design, QMP and NBD endpoints stay local; only NATS uses the network.
  • Therefore, IPv6 is primarily a concern for NATS connectivity, which supports IPv6.

Conclusion: Original design stands; QEMU doesn’t need to speak NBD over the network.


3) Networking: Unnumbered BGP and “No BGP on Bare Metal” Option

Question explored

Do we need FRR on bare metal for BGP unnumbered?

Learning / conclusion

No—FRR/BGP on bare metal is not strictly required.

Bare metal can:

  • Set service/identity IPs as IPv6 /128 loopbacks (for services)
  • Use a default route to the ToR via link-local (via RA or static)
  • Report identity + loopbacks + adjacency info via the NATS agent
  • Optionally use LLDP (lldpd) to identify which ToR/port it is connected to
  • Have the control plane (or a route injector) program ToR routing so the fabric can reach the BM loopbacks

Security conclusion

“No BGP on hosts” removes host-side BGP security concerns:

  • No TCP/179 listener on hosts
  • No BGP state machine / BGP daemon CVE exposure on hosts
  • No MD5/TCP-AO key distribution to every tenant host
  • A compromised host cannot directly inject routes into the fabric via BGP

This was explicitly recognized as a security win.


4) Routing / ToR Interaction Model (Bare Metal)

Bare metal model (no host FRR)

  • BM host config: loopbacks + default route
  • BM host does not peer BGP
  • ToR learns host link-local via ND; control plane binds prefix → port

Mechanisms to install routes on ToR:

  • Centralized route injector (GoBGP/FRR) subscribing to NATS updates (optional)
  • Or direct ToR programming (e.g., gNMI / vendor APIs) (not finalized here)

5) Traffic Patterns for Large Migrations (Cross-Rack Reality)

Problem identified

Most migrations will be cross-rack, which reduces the value of “rack-local” routing shortcuts unless the design ensures data does not hairpin or overload a centralized point.

Leafnodes discussion (as optional optimization)

  • Leafnodes per rack can help for:
    • connection aggregation
    • failure isolation (reconnect storms localized)
    • rate-limiting / shaping at the edge
  • But cross-rack data still traverses the core/hub in typical leaf→core→leaf topology.

Conclusion: Leafnodes are useful, but they do not eliminate a potential core bandwidth hotspot for cross-rack bulk migration.

Whitebox switch running leafnode?

  • Running NATS leafnode on a Broadcom whitebox switch control-plane is technically possible but likely CPU constrained for bulk (e.g., multi-stream 100G).
  • Also increases operational and security risk on network devices.
  • Preferred: leafnodes on rack controllers or compute hosts, not on the switch.

6) Strategic Choice: Separate Control-Plane vs Data-Plane Messaging

Key conclusion

To avoid bulk data (migrations) starving control traffic, split messaging into two planes from day one:

  • Control plane NATS connection (small messages, high priority)
  • Data plane NATS connection (bulk block streaming, no JetStream initially)

Even if both connect to the same cluster in dev initially, having two connections from day one makes the future split low-risk.

Decision: Two connections from day one is desired.


7) Subject Taxonomy: Plane + EDA Intent

Desired dimensions

  1. Plane: ctl vs data
  2. EDA intent: cmd, qry, evt (and for bulk payload a special channel was recommended)

Conclusion

Use one subject per message that encodes both plane and intent:

  • Control plane:

    • ctl.cmd.>: commands
    • ctl.qry.>: queries (request/reply)
    • ctl.evt.>: events (often JetStream-friendly)
  • Data plane:

    • data.cmd.>: setup/teardown
    • data.evt.>: progress/done/error
    • data.blk.> (recommended): raw high-rate block chunks + ACK/backpressure

This makes ACLs, quotas, and operational guarantees much easier.


8) Development Environment Approach: Tilt + VMs + Compose

Dev environment learning

Tilt targeting VMs as the default developer path is approachable and aligns with the project constraints, as long as Tilt is used as an orchestrator (scripts/Makefile do the heavy lifting).

Why VMs are good

  • You can simulate a mini edge DC locally/CI:
    • NATS control + NATS data endpoints
    • compute nodes (Kairos and/or Talos as VMs)
    • optional ToR simulation later
  • Quick iteration: build agent → push into VM → restart → test

Talos + Kairos as VMs

  • Both can run as VMs for dev workflows.
  • Kairos can be convenient early for iteration.
  • Talos helps validate the “immutable/no SSH” production posture.

Compose alongside Tilt

Running docker compose on the side is a practical way to host:

  • NATS clusters (ctl+data)
  • simulated external services
  • observability stack
  • control-plane service stubs/mocks

Tilt can orchestrate both Compose and VM lifecycles.


9) Diagrams

9.1 Messaging planes + EDA intent

Node Agent (VM host / BM host / RAM agent)

Data NATS (bulk, tuned, NO JetStream)

data.cmd.>

data.evt.>

data.blk.> (chunks/acks)

Control NATS (small, strict, JetStream OK)

ctl.cmd.>

ctl.qry.>

ctl.evt.>

ctl bus connection

data bus connection

9.2 Bare metal without FRR/BGP

ToR (SONiC/FRR)Control PlaneNATS (ctl)lldpd (optional)Bare Metal Host (no FRR)ToR (SONiC/FRR)Control PlaneNATS (ctl)lldpd (optional)Bare Metal Host (no FRR)Configure lo: service loopbacks (/128)Default route via ToR link-local (RA or static)(Optional) discover ToR + portPublish loopbacks + adjacency (ToR/port if known)Deliver announcementProgram route(prefix -> port / next-hop)Ack/telemetry (optional)

9.3 Migration data path concept (NBD tunneled over NATS)

data.blk.blocks.op.chunk

control

control

QEMU source\n(local QMP + local NBD endpoint)

NBD bridge\n(NATS client)

NATS data plane

NBD bridge\n(NATS client)

QEMU target\n(writes to disk)

NATS control plane


10) Final Conclusions Checklist

Networking

  • Bare metal can still use loopback /128 service IPs without running FRR.
  • Avoiding BGP on BM removes a significant security surface.
  • LLDP can help identify ToR port for route programming.

Migration transport

  • Keep QMP + NBD local; tunnel over NATS.
  • IPv6 is not a blocker for QEMU migration in this design; NATS over IPv6 is the key.

Scaling / traffic

  • Cross-rack migrations are the norm → plan for a real bulk transport plane.
  • Leafnodes help with connection aggregation and shaping but do not avoid core traversal for cross-rack in standard topology.
  • Avoid running NATS leafnodes on switch CPUs for high-throughput data.

Messaging and architecture

  • Use two NATS connections from day one: control + data.
  • Use EDA intent prefixes (cmd/qry/evt) and a separate bulk prefix (blk) for high-rate payloads.

Dev environment

  • Tilt driving VM-based environments is a good default.
  • Compose can host external dependencies and simulated services cleanly.
  • Use profiles (vm vs lab) later to target real environments with the same artifacts and contracts.