Operator visibility and control surface SENTRAL 022

Proposed Architecture Operations Visibility Control surface Runbooks Operator experience

Defines the unified operator-facing visibility and control surface for tasks, health, drift, incidents, and remediation.

Author
Lars Solem
Updated

Status

Proposed on 2026-03-14 by Lars Solem.

Context

Dataverket now has separate decisions for observability, workflow handling, inventory drift, and public APIs.

What is still missing is an explicit operator-facing model for how humans actually get situational awareness across the platform. Operators need one coherent way to understand:

  • what the platform is doing
  • what failed
  • what is degraded
  • what needs approval or intervention
  • what is happening in each datacenter

Without this, observability remains a collection of signals rather than a usable operational surface.

Decision

Dataverket provides a unified operator control surface for platform visibility and operational action.

This control surface may be composed of:

  • API endpoints
  • CLI workflows
  • dashboards or an operator UI

But it must behave as one coherent operator view rather than a scattered set of unrelated tools.

The operator model must cover not only visibility, but also how operators respond: runbooks, escalation, and the boundary between automated remediation and manual intervention.

Minimum operator questions the platform must answer

The operator control surface must make it possible to answer:

  • what tasks are currently running
  • what failed recently
  • what retries or dead-lettered workflows exist
  • what services or sites are degraded
  • what inventory drift or pending approvals exist
  • what resources are affected by an incident
  • what actions are safe to retry, cancel, or approve
  • which runbook or response path applies
  • whether the platform is attempting automatic remediation or waiting for a human decision

Core operator views

The first version should provide at least these views:

  1. Task view Running, queued, failed, retried, cancelled, and dead-lettered work.

  2. Event and incident view Recent important platform events, correlated by task, resource, and site.

  3. Health view Service, dependency, and site-level health signals.

  4. Inventory and drift view Current declared inventory, discovered changes, trust level, and pending approvals.

  5. Datacenter view Site-local and cross-site status, including communication and failover-relevant signals.

Control actions

The operator surface must also support controlled operational actions, such as:

  • inspect a task
  • retry eligible failed work
  • cancel long-running work where supported
  • inspect dead-lettered work
  • approve or reject sensitive inventory changes
  • inspect site-specific degradation and recent failover activity

Not every action must be available in every interface, but the supported operator workflows must be coherent and auditable.

Runbook model

Important operational conditions should map to explicit runbook categories rather than leaving operators to improvise from raw telemetry.

The first operator model should distinguish at least:

  • Automatic remediation The platform may retry, reconcile, or recover without waiting for a human, while keeping the action visible and auditable.

  • Operator-approved remediation The platform can prepare or validate a recovery action, but a human must approve execution because the risk or blast radius is too high.

  • Manual intervention required The platform can detect and explain the condition, but a human must perform or coordinate the response.

Examples that may require operator approval or manual action include:

  • failover between datacenters
  • sensitive network topology changes
  • destructive recovery or restore operations
  • certificate or trust repair with cross-service impact
  • inventory trust conflicts that affect shared infrastructure

Runbook requirements

For important incidents and workflows, the platform should define:

  • triggering conditions
  • expected severity or escalation class
  • whether automatic remediation is allowed
  • which API or CLI actions the operator may take
  • what evidence or state must be reviewed first
  • how the outcome is recorded in audit and task history

Runbooks may be implemented as documentation, API-linked procedures, or UI guidance, but they must be tied to supported operator actions rather than free-form tribal knowledge.

Escalation model

The operator surface should support a simple escalation model that distinguishes at least:

  • informational conditions
  • operator-action-required conditions
  • urgent conditions affecting control-plane safety or multi-site availability

The exact on-call tooling can be chosen later, but the platform must expose enough context to route incidents and explain why escalation occurred.

API and CLI relationship

The operator control surface should be built on supported APIs.

That means:

  • operator actions should map to explicit API operations
  • CLI workflows should target those APIs
  • dashboards or UI should not rely on hidden backend behavior unavailable elsewhere

This keeps operator workflows automatable and auditable.

Correlation requirements

The operator surface must correlate:

  • task IDs
  • correlation IDs
  • causation IDs
  • resource identifiers
  • tenant/project/environment context where relevant
  • datacenter identity

Operators should not have to manually stitch together incidents from raw logs alone.

Multi-datacenter requirements

Because Dataverket is multi-datacenter, the operator surface must show:

  • which datacenter a task or failure belongs to
  • whether an issue is site-local or cross-site
  • current inter-datacenter communication health
  • recent failover or recovery actions relevant to the operator

Any operator surface that collapses all sites into one undifferentiated status view is insufficient.

Access and audit requirements

Operator visibility must still respect authorization boundaries.

That means:

  • operator-only infrastructure views remain separate from tenant-facing views
  • sensitive actions require proper authorization
  • operator actions are auditable

Explicit non-decisions for now

This ADR intentionally does not yet choose:

  • a specific UI product
  • a specific dashboard technology
  • the exact split between CLI and web-based operational tooling

Those can be decided later as long as they satisfy this operator model.

Consequences

  • observability signals must now be assembled into real operator workflows
  • task, drift, and site status become part of one operational story instead of isolated subsystems
  • later tooling choices will be constrained by the need for coherent cross-domain visibility

Decision Outcome

Proposed. This ADR records the current preferred direction and still needs acceptance before it becomes binding.

More Information

  • operator task inspection API
  • operator approval workflow API
  • dashboard or UI implementation choice
  • operator RBAC model

Audit

  • 2026-03-14: ADR proposed.