Observability and operations baseline DATAVERKET 017

Proposed Infrastructure Observability Operations Monitoring Alerts Tracing

Defines observability, operational visibility, and actionable monitoring as a baseline requirement for the Dataverket platform.

Author: Lars Solem
Updated: 2026-03-14

Context

Dataverket is building an event-driven, multi-service, multi-datacenter platform that will automate provisioning, networking, clusters, VMs, storage, and failover.

That architecture is not operable without a shared observability baseline. Logs, metrics, traces, task visibility, and health signals cannot be treated as optional product polish added after the platform already exists.

Decision

Dataverket adopts observability and operational visibility as a Phase 0 and Phase 1 platform baseline.

The platform must provide a shared operational capability for:

logs
metrics
traces or equivalent workflow correlation
task inspection
alerting inputs
health and drift visibility

This capability may initially be delivered as modules rather than standalone services, but it is a required foundation for all later platform products.

Observability alone is not sufficient. The platform must also make those signals actionable within a day-2 operating model that includes remediation boundaries, escalation inputs, and runbook linkage.

Scope

The observability baseline must cover:

control-plane services
background workers
NATS-driven workflows
provisioning workflows
network automation workflows
storage and persistence workflows
multi-datacenter coordination paths

Logging requirements

The baseline must support:

centralized log collection
structured logs where practical
correlation between logs and task or workflow identifiers
retention suitable for incident investigation

Logs must be useful for debugging platform behavior, not just for raw archival.

Metrics requirements

The baseline must support metrics for:

service health and latency
task throughput and failure rates
NATS delivery and consumer health
provisioning success and failure
network automation success, drift, and rollout health
control-plane PostgreSQL health
storage and capacity visibility

Metrics should support both dashboards and alerting.

Trace and workflow correlation requirements

The platform must provide a way to correlate work across service boundaries.

This may be distributed tracing, or an equivalent model built from:

correlation IDs
causation IDs
task identifiers
event lineage

For Dataverket, workflow correlation is mandatory even if the initial implementation is not a full tracing stack.

Task and operator visibility

Operators must be able to answer:

what is running
what failed
why it failed
what it was trying to affect
whether it is safe to retry
which dependency is unavailable or degraded
whether the affected workflow is blocked, retrying, or running in reduced mode

That means task state, recent events, and key failure signals must be inspectable through supported operational surfaces.

Alerting and incident inputs

The baseline must support alerting inputs for:

platform service health degradation
stuck or repeatedly failing workflows
NATS consumer or stream issues
inventory drift that blocks automation
storage and database risk conditions
inter-datacenter communication problems
critical dependency failures between internal services
degraded-mode activation for important workflows

This ADR does not define final alert routing or on-call process, but it requires that the data needed for alerting exists.

Alerting inputs should also carry enough classification to support:

automatic remediation when pre-approved by policy
operator-action-required incidents
escalation to higher-severity operational response when control-plane safety or site availability is at risk

Multi-datacenter requirements

Because Dataverket is multi-datacenter, observability must support:

site-aware signals
visibility into inter-site message and failover paths
ability to distinguish local failure from cross-site dependency failure
enough context to debug failover and degraded-mode behavior

Operators should also be able to see service dependency health in a way that explains platform impact, not just raw component status. A green NATS cluster and a red Nett service mean something very different for provisioning than for object storage.

Observability that cannot tell operators which site is failing is insufficient.

Operational baseline

The first production-capable platform release must provide:

dashboards or equivalent operator views
searchable logs or equivalent log inspection workflow
task and workflow inspection
baseline alerting inputs
incident-friendly correlation across services
linkage from important alerts or blocked workflows to the relevant runbook or supported response path

The implementation may evolve, but the capability must exist from the beginning.

Explicit non-decisions for now

This ADR intentionally does not yet choose:

a specific metrics stack
a specific logging backend
a specific tracing backend
a final dashboard product
a final alert routing product

Those require later implementation or selection ADRs.

Consequences

observability is no longer deferrable as a late-stage enhancement
platform teams must emit usable operational signals from the first services they build
debugging NATS workflows, failover, and provisioning becomes an explicit architectural requirement

Decision Outcome

Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.

This ADR supports the operational assumptions in platform-plan.md.
Workflow visibility depends on the envelope and correlation model in 007-nats-subject-and-event-envelope.md.
Task inspection must fit the public API and task model direction in 008-public-api-style.md.
Multi-datacenter observability must align with 012-inter-datacenter-topology-and-failover.md.

More Information

metrics stack selection
logging stack selection
tracing and workflow correlation implementation
alerting and incident routing model
task inspection surface and operator tooling

Audit

2026-03-14: ADR proposed.