Conversation Summary (2026-03-03)

This document summarizes the architecture ideas, design decisions, and conclusions from the conversation. It is written to be reusable in future sessions.

1) Core Goal and Philosophy

We are designing an edge / DC platform to manage:

KVM/QEMU VMs (primary)
Bare metal workloads (also important)
Migration and backup across heterogeneous environments (VM↔VM, VM↔BM, BM↔BM)
Provisioning with strong security properties (TPM attestation, signed approvals)
A consistent “minimal code, minimal attack surface” approach

Guiding principles

Minimize custom LOC (reduce attack surface and operational complexity).
Prefer immutable OS images (Talos/Kairos evaluated; Talos highlighted for immutability/no SSH).
Use IPv6 and unnumbered patterns to reduce IP management.
Unify operations around NATS for control-plane and (optionally) data-plane transport.
Use EDA/DDD concepts: commands, queries, events.

2) Key Breakthrough: NBD-over-NATS for Universal Block Streaming

Learning / conclusion

The “genius” primitive is tunneling block device traffic (NBD-like semantics) over NATS so the same mechanism can support:

Live-ish migration streams
Disk copy / restore
Backups (e.g., to Kopia via stdin)
VM↔BM conversions
Provisioning (streaming into a RAM-booted installer)

This collapses many otherwise separate systems (storage replication, migration protocols, backup pipelines) into one reusable transport + protocol.

IPv6 relevance

Direct QEMU NBD over IPv6 is generally possible, but not required for the chosen design.
In the chosen design, QMP and NBD endpoints stay local; only NATS uses the network.
Therefore, IPv6 is primarily a concern for NATS connectivity, which supports IPv6.

Conclusion: Original design stands; QEMU doesn’t need to speak NBD over the network.

3) Networking: Unnumbered BGP and “No BGP on Bare Metal” Option

Question explored

Do we need FRR on bare metal for BGP unnumbered?

Learning / conclusion

No—FRR/BGP on bare metal is not strictly required.

Bare metal can:

Set service/identity IPs as IPv6 /128 loopbacks (for services)
Use a default route to the ToR via link-local (via RA or static)
Report identity + loopbacks + adjacency info via the NATS agent
Optionally use LLDP (lldpd) to identify which ToR/port it is connected to
Have the control plane (or a route injector) program ToR routing so the fabric can reach the BM loopbacks

Security conclusion

“No BGP on hosts” removes host-side BGP security concerns:

No TCP/179 listener on hosts
No BGP state machine / BGP daemon CVE exposure on hosts
No MD5/TCP-AO key distribution to every tenant host
A compromised host cannot directly inject routes into the fabric via BGP

This was explicitly recognized as a security win.

4) Routing / ToR Interaction Model (Bare Metal)

Bare metal model (no host FRR)

BM host config: loopbacks + default route
BM host does not peer BGP
ToR learns host link-local via ND; control plane binds prefix → port

Mechanisms to install routes on ToR:

Centralized route injector (GoBGP/FRR) subscribing to NATS updates (optional)
Or direct ToR programming (e.g., gNMI / vendor APIs) (not finalized here)

5) Traffic Patterns for Large Migrations (Cross-Rack Reality)

Problem identified

Most migrations will be cross-rack, which reduces the value of “rack-local” routing shortcuts unless the design ensures data does not hairpin or overload a centralized point.

Leafnodes discussion (as optional optimization)

Leafnodes per rack can help for:
- connection aggregation
- failure isolation (reconnect storms localized)
- rate-limiting / shaping at the edge
But cross-rack data still traverses the core/hub in typical leaf→core→leaf topology.

Conclusion: Leafnodes are useful, but they do not eliminate a potential core bandwidth hotspot for cross-rack bulk migration.

Whitebox switch running leafnode?

Running NATS leafnode on a Broadcom whitebox switch control-plane is technically possible but likely CPU constrained for bulk (e.g., multi-stream 100G).
Also increases operational and security risk on network devices.
Preferred: leafnodes on rack controllers or compute hosts, not on the switch.

6) Strategic Choice: Separate Control-Plane vs Data-Plane Messaging

Key conclusion

To avoid bulk data (migrations) starving control traffic, split messaging into two planes from day one:

Control plane NATS connection (small messages, high priority)
Data plane NATS connection (bulk block streaming, no JetStream initially)

Even if both connect to the same cluster in dev initially, having two connections from day one makes the future split low-risk.

Decision: Two connections from day one is desired.

7) Subject Taxonomy: Plane + EDA Intent

Desired dimensions

Plane: ctl vs data
EDA intent: cmd, qry, evt (and for bulk payload a special channel was recommended)

Conclusion

Use one subject per message that encodes both plane and intent:

Control plane:
- ctl.cmd.>: commands
- ctl.qry.>: queries (request/reply)
- ctl.evt.>: events (often JetStream-friendly)
Data plane:
- data.cmd.>: setup/teardown
- data.evt.>: progress/done/error
- data.blk.> (recommended): raw high-rate block chunks + ACK/backpressure

This makes ACLs, quotas, and operational guarantees much easier.

You can simulate a mini edge DC locally/CI:
- NATS control + NATS data endpoints
- compute nodes (Kairos and/or Talos as VMs)
- optional ToR simulation later
Quick iteration: build agent → push into VM → restart → test

Talos + Kairos as VMs

Both can run as VMs for dev workflows.
Kairos can be convenient early for iteration.
Talos helps validate the “immutable/no SSH” production posture.

Compose alongside Tilt

Running docker compose on the side is a practical way to host:

NATS clusters (ctl+data)
simulated external services
observability stack
control-plane service stubs/mocks

Tilt can orchestrate both Compose and VM lifecycles.

Bare metal can still use loopback /128 service IPs without running FRR.
Avoiding BGP on BM removes a significant security surface.
LLDP can help identify ToR port for route programming.

Migration transport

Keep QMP + NBD local; tunnel over NATS.
IPv6 is not a blocker for QEMU migration in this design; NATS over IPv6 is the key.

Scaling / traffic

Cross-rack migrations are the norm → plan for a real bulk transport plane.
Leafnodes help with connection aggregation and shaping but do not avoid core traversal for cross-rack in standard topology.
Avoid running NATS leafnodes on switch CPUs for high-throughput data.

Messaging and architecture

Use two NATS connections from day one: control + data.
Use EDA intent prefixes (cmd/qry/evt) and a separate bulk prefix (blk) for high-rate payloads.

Dev environment

Tilt driving VM-based environments is a good default.
Compose can host external dependencies and simulated services cleanly.
Use profiles (vm vs lab) later to target real environments with the same artifacts and contracts.

Conversation Summary (2026-03-03)

1) Core Goal and Philosophy

2) Key Breakthrough: NBD-over-NATS for Universal Block Streaming

Learning / conclusion

IPv6 relevance

3) Networking: Unnumbered BGP and “No BGP on Bare Metal” Option

Question explored

Learning / conclusion

Security conclusion

4) Routing / ToR Interaction Model (Bare Metal)

Bare metal model (no host FRR)

5) Traffic Patterns for Large Migrations (Cross-Rack Reality)

Problem identified

Leafnodes discussion (as optional optimization)

Whitebox switch running leafnode?

6) Strategic Choice: Separate Control-Plane vs Data-Plane Messaging

Key conclusion

7) Subject Taxonomy: Plane + EDA Intent

Desired dimensions

Conclusion

8) Development Environment Approach: Tilt + VMs + Compose

Dev environment learning

Why VMs are good

Talos + Kairos as VMs

Compose alongside Tilt

9) Diagrams

9.1 Messaging planes + EDA intent

9.2 Bare metal without FRR/BGP

9.3 Migration data path concept (NBD tunneled over NATS)

10) Final Conclusions Checklist

Networking

Migration transport

Scaling / traffic

Messaging and architecture

Dev environment