Status
Proposed on 2026-03-14 by Lars Solem.
Context
Dataverket now has separate decisions for observability, workflow handling, inventory drift, and public APIs.
What is still missing is an explicit operator-facing model for how humans actually get situational awareness across the platform. Operators need one coherent way to understand:
- what the platform is doing
- what failed
- what is degraded
- what needs approval or intervention
- what is happening in each datacenter
Without this, observability remains a collection of signals rather than a usable operational surface.
Decision
Dataverket provides a unified operator control surface for platform visibility and operational action.
This control surface may be composed of:
- API endpoints
- CLI workflows
- dashboards or an operator UI
But it must behave as one coherent operator view rather than a scattered set of unrelated tools.
The operator model must cover not only visibility, but also how operators respond: runbooks, escalation, and the boundary between automated remediation and manual intervention.
Minimum operator questions the platform must answer
The operator control surface must make it possible to answer:
- what tasks are currently running
- what failed recently
- what retries or dead-lettered workflows exist
- what services or sites are degraded
- what inventory drift or pending approvals exist
- what resources are affected by an incident
- what actions are safe to retry, cancel, or approve
- which runbook or response path applies
- whether the platform is attempting automatic remediation or waiting for a human decision
Core operator views
The first version should provide at least these views:
Task view Running, queued, failed, retried, cancelled, and dead-lettered work.
Event and incident view Recent important platform events, correlated by task, resource, and site.
Health view Service, dependency, and site-level health signals.
Inventory and drift view Current declared inventory, discovered changes, trust level, and pending approvals.
Datacenter view Site-local and cross-site status, including communication and failover-relevant signals.
Control actions
The operator surface must also support controlled operational actions, such as:
- inspect a task
- retry eligible failed work
- cancel long-running work where supported
- inspect dead-lettered work
- approve or reject sensitive inventory changes
- inspect site-specific degradation and recent failover activity
Not every action must be available in every interface, but the supported operator workflows must be coherent and auditable.
Runbook model
Important operational conditions should map to explicit runbook categories rather than leaving operators to improvise from raw telemetry.
The first operator model should distinguish at least:
Automatic remediation The platform may retry, reconcile, or recover without waiting for a human, while keeping the action visible and auditable.
Operator-approved remediation The platform can prepare or validate a recovery action, but a human must approve execution because the risk or blast radius is too high.
Manual intervention required The platform can detect and explain the condition, but a human must perform or coordinate the response.
Examples that may require operator approval or manual action include:
- failover between datacenters
- sensitive network topology changes
- destructive recovery or restore operations
- certificate or trust repair with cross-service impact
- inventory trust conflicts that affect shared infrastructure
Runbook requirements
For important incidents and workflows, the platform should define:
- triggering conditions
- expected severity or escalation class
- whether automatic remediation is allowed
- which API or CLI actions the operator may take
- what evidence or state must be reviewed first
- how the outcome is recorded in audit and task history
Runbooks may be implemented as documentation, API-linked procedures, or UI guidance, but they must be tied to supported operator actions rather than free-form tribal knowledge.
Escalation model
The operator surface should support a simple escalation model that distinguishes at least:
- informational conditions
- operator-action-required conditions
- urgent conditions affecting control-plane safety or multi-site availability
The exact on-call tooling can be chosen later, but the platform must expose enough context to route incidents and explain why escalation occurred.
API and CLI relationship
The operator control surface should be built on supported APIs.
That means:
- operator actions should map to explicit API operations
- CLI workflows should target those APIs
- dashboards or UI should not rely on hidden backend behavior unavailable elsewhere
This keeps operator workflows automatable and auditable.
Correlation requirements
The operator surface must correlate:
- task IDs
- correlation IDs
- causation IDs
- resource identifiers
- tenant/project/environment context where relevant
- datacenter identity
Operators should not have to manually stitch together incidents from raw logs alone.
Multi-datacenter requirements
Because Dataverket is multi-datacenter, the operator surface must show:
- which datacenter a task or failure belongs to
- whether an issue is site-local or cross-site
- current inter-datacenter communication health
- recent failover or recovery actions relevant to the operator
Any operator surface that collapses all sites into one undifferentiated status view is insufficient.
Access and audit requirements
Operator visibility must still respect authorization boundaries.
That means:
- operator-only infrastructure views remain separate from tenant-facing views
- sensitive actions require proper authorization
- operator actions are auditable
Explicit non-decisions for now
This ADR intentionally does not yet choose:
- a specific UI product
- a specific dashboard technology
- the exact split between CLI and web-based operational tooling
Those can be decided later as long as they satisfy this operator model.
Consequences
- observability signals must now be assembled into real operator workflows
- task, drift, and site status become part of one operational story instead of isolated subsystems
- later tooling choices will be constrained by the need for coherent cross-domain visibility
Decision Outcome
Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.
Related Decisions
- This ADR builds on 017-observability-and-operations-baseline.md.
- Task visibility and replay concerns must align with 019-workflow-retry-dead-letter-and-reconciliation.md.
- Inventory approvals and drift visibility must align with 021-inventory-bootstrap-and-drift-management.md.
- Public API and operator actions must align with 008-public-api-style.md.
More Information
- operator task inspection API
- operator approval workflow API
- dashboard or UI implementation choice
- operator RBAC model
Audit
- 2026-03-14: ADR proposed.