Status
Proposed on 2026-03-14 by Lars Solem.
Context
Dataverket is building an event-driven, multi-service, multi-datacenter platform that will automate provisioning, networking, clusters, VMs, storage, and failover.
That architecture is not operable without a shared observability baseline. Logs, metrics, traces, task visibility, and health signals cannot be treated as optional product polish added after the platform already exists.
Decision
Dataverket adopts observability and operational visibility as a Phase 0 and Phase 1 platform baseline.
The platform must provide a shared operational capability for:
- logs
- metrics
- traces or equivalent workflow correlation
- task inspection
- alerting inputs
- health and drift visibility
This capability may initially be delivered as modules rather than standalone services, but it is a required foundation for all later platform products.
Observability alone is not sufficient. The platform must also make those signals actionable within a day-2 operating model that includes remediation boundaries, escalation inputs, and runbook linkage.
Scope
The observability baseline must cover:
- control-plane services
- background workers
- NATS-driven workflows
- provisioning workflows
- network automation workflows
- storage and persistence workflows
- multi-datacenter coordination paths
Logging requirements
The baseline must support:
- centralized log collection
- structured logs where practical
- correlation between logs and task or workflow identifiers
- retention suitable for incident investigation
Logs must be useful for debugging platform behavior, not just for raw archival.
Metrics requirements
The baseline must support metrics for:
- service health and latency
- task throughput and failure rates
- NATS delivery and consumer health
- provisioning success and failure
- network automation success, drift, and rollout health
- control-plane PostgreSQL health
- storage and capacity visibility
Metrics should support both dashboards and alerting.
Trace and workflow correlation requirements
The platform must provide a way to correlate work across service boundaries.
This may be distributed tracing, or an equivalent model built from:
- correlation IDs
- causation IDs
- task identifiers
- event lineage
For Dataverket, workflow correlation is mandatory even if the initial implementation is not a full tracing stack.
Task and operator visibility
Operators must be able to answer:
- what is running
- what failed
- why it failed
- what it was trying to affect
- whether it is safe to retry
- which dependency is unavailable or degraded
- whether the affected workflow is blocked, retrying, or running in reduced mode
That means task state, recent events, and key failure signals must be inspectable through supported operational surfaces.
Alerting and incident inputs
The baseline must support alerting inputs for:
- platform service health degradation
- stuck or repeatedly failing workflows
- NATS consumer or stream issues
- inventory drift that blocks automation
- storage and database risk conditions
- inter-datacenter communication problems
- critical dependency failures between internal services
- degraded-mode activation for important workflows
This ADR does not define final alert routing or on-call process, but it requires that the data needed for alerting exists.
Alerting inputs should also carry enough classification to support:
- automatic remediation when pre-approved by policy
- operator-action-required incidents
- escalation to higher-severity operational response when control-plane safety or site availability is at risk
Multi-datacenter requirements
Because Dataverket is multi-datacenter, observability must support:
- site-aware signals
- visibility into inter-site message and failover paths
- ability to distinguish local failure from cross-site dependency failure
- enough context to debug failover and degraded-mode behavior
Operators should also be able to see service dependency health in a way that explains platform impact, not just raw component status. A green NATS cluster and a red Nett service mean something very different for provisioning than for object storage.
Observability that cannot tell operators which site is failing is insufficient.
Operational baseline
The first production-capable platform release must provide:
- dashboards or equivalent operator views
- searchable logs or equivalent log inspection workflow
- task and workflow inspection
- baseline alerting inputs
- incident-friendly correlation across services
- linkage from important alerts or blocked workflows to the relevant runbook or supported response path
The implementation may evolve, but the capability must exist from the beginning.
Explicit non-decisions for now
This ADR intentionally does not yet choose:
- a specific metrics stack
- a specific logging backend
- a specific tracing backend
- a final dashboard product
- a final alert routing product
Those require later implementation or selection ADRs.
Consequences
- observability is no longer deferrable as a late-stage enhancement
- platform teams must emit usable operational signals from the first services they build
- debugging NATS workflows, failover, and provisioning becomes an explicit architectural requirement
Decision Outcome
Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.
Related Decisions
- This ADR supports the operational assumptions in platform-plan.md.
- Workflow visibility depends on the envelope and correlation model in 007-nats-subject-and-event-envelope.md.
- Task inspection must fit the public API and task model direction in 008-public-api-style.md.
- Multi-datacenter observability must align with 012-inter-datacenter-topology-and-failover.md.
More Information
- metrics stack selection
- logging stack selection
- tracing and workflow correlation implementation
- alerting and incident routing model
- task inspection surface and operator tooling
Audit
- 2026-03-14: ADR proposed.