Input Telemetry and Actuation Interfaces

This page describes runtime IO contracts:

  • how Joulie reads telemetry inputs,
  • how Joulie sends control intents.

If you want the CRD-level summary first, read CRD and Policy Model. This page is the detailed runtime reference for the telemetry and control contract.

It is not the /metrics exposition contract. For exported metrics, see Metrics Reference.

Why this abstraction exists

Joulie must run in two worlds with the same control logic:

  • real hardware clusters,
  • simulator/KWOK clusters.

So agent/operator logic depends on provider interfaces, not directly on sysfs or simulator HTTP shape.

Backend selection in one sentence

Environment variables tell the agent which telemetry backend to read from and which control backend to write to.

Conceptually:

  • NodeHardware defines discovered node capability
  • NodeTwin.spec defines the target
  • Agent environment variables define the wiring

Backend configuration should not carry hardware identity. Its job is only backend routing. In particular:

  • do not conflate backend selection with CPU/GPU model or inventory identity,
  • do not use it as a substitute for NodeHardware,
  • do use it to route simulator HTTP or host backends.

Telemetry provider model

Telemetry input can come from:

  • host (real node interfaces),
  • http (simulator endpoint),
  • prometheus (cluster-level inputs, future expansion),
  • none.

Quality handling should distinguish fresh, stale, and missing samples so policy can degrade safely.

Control provider model

Control outputs are routed by backend:

  • host: write real interfaces (RAPL/cpufreq),
  • http: send intents to simulator,
  • none.

Result semantics:

  • applied
  • blocked
  • error
  • none (no intent selected for that control path)

This allows desired vs applied behavior auditing.

Current CPU control intents

Supported intent actions:

  • rapl.set_power_cap_watts
  • dvfs.set_throttle_pct

DVFS intent is normalized as throttlePct (0..100) to stay portable across heterogeneous CPUs. Backend-specific frequency writes remain implementation details.

GPU control intent

Supported GPU intent action:

  • gpu.set_power_cap_watts

Payload:

  • capWattsPerGpu

Semantics:

  • node-level intent is translated to per-device enforcement,
  • same cap is applied to all GPUs on the node,
  • result is reported as applied|blocked|error in NodeTwin.status.controlStatus.gpu.

Current HTTP contracts

Telemetry endpoint:

  • GET /telemetry/{node}

Accepted minimal payloads include:

{ "packagePowerWatts": 245.3 }

or

{ "cpu": { "packagePowerWatts": 245.3 } }

The simulator exports a richer payload than this minimal contract. Important optional fields currently include:

  • top level:
    • packagePowerWatts
    • instantPackagePowerWatts
  • cpu.*:
    • packagePowerWatts
    • instantPowerWatts
    • utilization
    • memoryIntensity
    • ioIntensity
    • freqScale
    • temperatureC
    • thermalThrottlePct
  • gpu.*:
    • powerWattsTotal
    • avgPowerWattsTotal
    • utilization
    • memoryIntensity
    • cpuFeedIntensity
    • devices

Controllers should treat these richer fields as additive observability, not as a replacement for the minimal contract above.

Control endpoint:

  • POST /control/{node}

Example request:

{
  "node": "worker-01",
  "action": "dvfs.set_throttle_pct",
  "throttlePct": 20,
  "ts": "2026-02-27T00:00:00Z"
}

Expected response includes:

  • result (applied|blocked|error)
  • message
  • state (best-effort post-state)

Current host contracts

Agent host mode uses Linux interfaces:

  • RAPL energy/power cap files
  • cpufreq files for observed and enforced frequency bounds

GPU host control backends:

  • NVIDIA path: NVIDIA tooling/NVML-compatible power-limit controls
  • AMD path: ROCm SMI power-limit controls where supported

If a GPU backend is unavailable on a node, result is blocked (not silent success).

Current deployment convention mounts host /sys into container /host-sys.

CRD integration

Current runtime responsibilities:

  • agent publishes NodeHardware for discovered hardware/capability state,
  • operator writes NodeTwin.spec targets,
  • agent reads NodeTwin.spec for desired state,
  • agent selects telemetry/control backends via environment variables,
  • agent writes control status under NodeTwin.status.controlStatus.

Today, the stable documented status contract is NodeTwin.status.controlStatus. If additional diagnostic snapshots are present in some environments, treat them as auxiliary rather than core API contract.

Next step

To see simulator-side algorithm details using these interfaces, read: