Operator visibility and control surface SENTRAL 022

Proposed Architecture Operations Visibility Control surface Runbooks Operator experience

Defines the unified operator-facing visibility and control surface for tasks, health, drift, incidents, and remediation.

Author: Lars Solem
Updated: 2026-03-14

Status

Proposed on 2026-03-14 by Lars Solem.

Context

Dataverket now has separate decisions for observability, workflow handling, inventory drift, and public APIs.

What is still missing is an explicit operator-facing model for how humans actually get situational awareness across the platform. Operators need one coherent way to understand:

what the platform is doing
what failed
what is degraded
what needs approval or intervention
what is happening in each datacenter

Without this, observability remains a collection of signals rather than a usable operational surface.

Decision

Dataverket provides a unified operator control surface for platform visibility and operational action.

This control surface may be composed of:

API endpoints
CLI workflows
dashboards or an operator UI

But it must behave as one coherent operator view rather than a scattered set of unrelated tools.

The operator model must cover not only visibility, but also how operators respond: runbooks, escalation, and the boundary between automated remediation and manual intervention.

Minimum operator questions the platform must answer

The operator control surface must make it possible to answer:

what tasks are currently running
what failed recently
what retries or dead-lettered workflows exist
what services or sites are degraded
what inventory drift or pending approvals exist
what resources are affected by an incident
what actions are safe to retry, cancel, or approve
which runbook or response path applies
whether the platform is attempting automatic remediation or waiting for a human decision

Core operator views

The first version should provide at least these views:

Task view Running, queued, failed, retried, cancelled, and dead-lettered work.
Event and incident view Recent important platform events, correlated by task, resource, and site.
Health view Service, dependency, and site-level health signals.
Inventory and drift view Current declared inventory, discovered changes, trust level, and pending approvals.
Datacenter view Site-local and cross-site status, including communication and failover-relevant signals.

Control actions

The operator surface must also support controlled operational actions, such as:

inspect a task
retry eligible failed work
cancel long-running work where supported
inspect dead-lettered work
approve or reject sensitive inventory changes
inspect site-specific degradation and recent failover activity

Not every action must be available in every interface, but the supported operator workflows must be coherent and auditable.

Runbook model

Important operational conditions should map to explicit runbook categories rather than leaving operators to improvise from raw telemetry.

The first operator model should distinguish at least:

Automatic remediation The platform may retry, reconcile, or recover without waiting for a human, while keeping the action visible and auditable.
Operator-approved remediation The platform can prepare or validate a recovery action, but a human must approve execution because the risk or blast radius is too high.
Manual intervention required The platform can detect and explain the condition, but a human must perform or coordinate the response.

Examples that may require operator approval or manual action include:

failover between datacenters
sensitive network topology changes
destructive recovery or restore operations
certificate or trust repair with cross-service impact
inventory trust conflicts that affect shared infrastructure

Runbook requirements

For important incidents and workflows, the platform should define:

triggering conditions
expected severity or escalation class
whether automatic remediation is allowed
which API or CLI actions the operator may take
what evidence or state must be reviewed first
how the outcome is recorded in audit and task history

Runbooks may be implemented as documentation, API-linked procedures, or UI guidance, but they must be tied to supported operator actions rather than free-form tribal knowledge.

Escalation model

The operator surface should support a simple escalation model that distinguishes at least:

informational conditions
operator-action-required conditions
urgent conditions affecting control-plane safety or multi-site availability

The exact on-call tooling can be chosen later, but the platform must expose enough context to route incidents and explain why escalation occurred.

API and CLI relationship

The operator control surface should be built on supported APIs.

That means:

operator actions should map to explicit API operations
CLI workflows should target those APIs
dashboards or UI should not rely on hidden backend behavior unavailable elsewhere

This keeps operator workflows automatable and auditable.

Correlation requirements

The operator surface must correlate:

task IDs
correlation IDs
causation IDs
resource identifiers
tenant/project/environment context where relevant
datacenter identity

Operators should not have to manually stitch together incidents from raw logs alone.

Multi-datacenter requirements

Because Dataverket is multi-datacenter, the operator surface must show:

which datacenter a task or failure belongs to
whether an issue is site-local or cross-site
current inter-datacenter communication health
recent failover or recovery actions relevant to the operator

Any operator surface that collapses all sites into one undifferentiated status view is insufficient.

Access and audit requirements

Operator visibility must still respect authorization boundaries.

That means:

operator-only infrastructure views remain separate from tenant-facing views
sensitive actions require proper authorization
operator actions are auditable

Explicit non-decisions for now

This ADR intentionally does not yet choose:

a specific UI product
a specific dashboard technology
the exact split between CLI and web-based operational tooling

Those can be decided later as long as they satisfy this operator model.

Consequences

observability signals must now be assembled into real operator workflows
task, drift, and site status become part of one operational story instead of isolated subsystems
later tooling choices will be constrained by the need for coherent cross-domain visibility

Decision Outcome

Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.

This ADR builds on 017-observability-and-operations-baseline.md.
Task visibility and replay concerns must align with 019-workflow-retry-dead-letter-and-reconciliation.md.
Inventory approvals and drift visibility must align with 021-inventory-bootstrap-and-drift-management.md.
Public API and operator actions must align with 008-public-api-style.md.

More Information

operator task inspection API
operator approval workflow API
dashboard or UI implementation choice
operator RBAC model

Audit

2026-03-14: ADR proposed.