Observability and operations baseline DATAVERKET 017

Proposed Infrastructure Observability Operations Monitoring Alerts Tracing

Defines observability, operational visibility, and actionable monitoring as a baseline requirement for the Dataverket platform.

Author
Lars Solem
Updated

Status

Proposed on 2026-03-14 by Lars Solem.

Context

Dataverket is building an event-driven, multi-service, multi-datacenter platform that will automate provisioning, networking, clusters, VMs, storage, and failover.

That architecture is not operable without a shared observability baseline. Logs, metrics, traces, task visibility, and health signals cannot be treated as optional product polish added after the platform already exists.

Decision

Dataverket adopts observability and operational visibility as a Phase 0 and Phase 1 platform baseline.

The platform must provide a shared operational capability for:

  • logs
  • metrics
  • traces or equivalent workflow correlation
  • task inspection
  • alerting inputs
  • health and drift visibility

This capability may initially be delivered as modules rather than standalone services, but it is a required foundation for all later platform products.

Observability alone is not sufficient. The platform must also make those signals actionable within a day-2 operating model that includes remediation boundaries, escalation inputs, and runbook linkage.

Scope

The observability baseline must cover:

  • control-plane services
  • background workers
  • NATS-driven workflows
  • provisioning workflows
  • network automation workflows
  • storage and persistence workflows
  • multi-datacenter coordination paths

Logging requirements

The baseline must support:

  • centralized log collection
  • structured logs where practical
  • correlation between logs and task or workflow identifiers
  • retention suitable for incident investigation

Logs must be useful for debugging platform behavior, not just for raw archival.

Metrics requirements

The baseline must support metrics for:

  • service health and latency
  • task throughput and failure rates
  • NATS delivery and consumer health
  • provisioning success and failure
  • network automation success, drift, and rollout health
  • control-plane PostgreSQL health
  • storage and capacity visibility

Metrics should support both dashboards and alerting.

Trace and workflow correlation requirements

The platform must provide a way to correlate work across service boundaries.

This may be distributed tracing, or an equivalent model built from:

  • correlation IDs
  • causation IDs
  • task identifiers
  • event lineage

For Dataverket, workflow correlation is mandatory even if the initial implementation is not a full tracing stack.

Task and operator visibility

Operators must be able to answer:

  • what is running
  • what failed
  • why it failed
  • what it was trying to affect
  • whether it is safe to retry
  • which dependency is unavailable or degraded
  • whether the affected workflow is blocked, retrying, or running in reduced mode

That means task state, recent events, and key failure signals must be inspectable through supported operational surfaces.

Alerting and incident inputs

The baseline must support alerting inputs for:

  • platform service health degradation
  • stuck or repeatedly failing workflows
  • NATS consumer or stream issues
  • inventory drift that blocks automation
  • storage and database risk conditions
  • inter-datacenter communication problems
  • critical dependency failures between internal services
  • degraded-mode activation for important workflows

This ADR does not define final alert routing or on-call process, but it requires that the data needed for alerting exists.

Alerting inputs should also carry enough classification to support:

  • automatic remediation when pre-approved by policy
  • operator-action-required incidents
  • escalation to higher-severity operational response when control-plane safety or site availability is at risk

Multi-datacenter requirements

Because Dataverket is multi-datacenter, observability must support:

  • site-aware signals
  • visibility into inter-site message and failover paths
  • ability to distinguish local failure from cross-site dependency failure
  • enough context to debug failover and degraded-mode behavior

Operators should also be able to see service dependency health in a way that explains platform impact, not just raw component status. A green NATS cluster and a red Nett service mean something very different for provisioning than for object storage.

Observability that cannot tell operators which site is failing is insufficient.

Operational baseline

The first production-capable platform release must provide:

  • dashboards or equivalent operator views
  • searchable logs or equivalent log inspection workflow
  • task and workflow inspection
  • baseline alerting inputs
  • incident-friendly correlation across services
  • linkage from important alerts or blocked workflows to the relevant runbook or supported response path

The implementation may evolve, but the capability must exist from the beginning.

Explicit non-decisions for now

This ADR intentionally does not yet choose:

  • a specific metrics stack
  • a specific logging backend
  • a specific tracing backend
  • a final dashboard product
  • a final alert routing product

Those require later implementation or selection ADRs.

Consequences

  • observability is no longer deferrable as a late-stage enhancement
  • platform teams must emit usable operational signals from the first services they build
  • debugging NATS workflows, failover, and provisioning becomes an explicit architectural requirement

Decision Outcome

Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.

More Information

  • metrics stack selection
  • logging stack selection
  • tracing and workflow correlation implementation
  • alerting and incident routing model
  • task inspection surface and operator tooling

Audit

  • 2026-03-14: ADR proposed.