Input Telemetry and Actuation Interfaces
This page describes runtime IO contracts:
- how Joulie reads telemetry inputs,
- how Joulie sends control intents.
If you want the CRD-level summary first, read CRD and Policy Model. This page is the detailed runtime reference for the telemetry and control contract.
It is not the /metrics exposition contract.
For exported metrics, see Metrics Reference.
Why this abstraction exists
Joulie must run in two worlds with the same control logic:
- real hardware clusters,
- simulator/KWOK clusters.
So agent/operator logic depends on provider interfaces, not directly on sysfs or simulator HTTP shape.
Backend selection in one sentence
Environment variables tell the agent which telemetry backend to read from and which control backend to write to.
Conceptually:
NodeHardwaredefines discovered node capabilityNodeTwin.specdefines the target- Agent environment variables define the wiring
Backend configuration should not carry hardware identity. Its job is only backend routing. In particular:
- do not conflate backend selection with CPU/GPU model or inventory identity,
- do not use it as a substitute for
NodeHardware, - do use it to route simulator HTTP or host backends.
Telemetry provider model
Telemetry input can come from:
host(real node interfaces),http(simulator endpoint),prometheus(cluster-level inputs, future expansion),none.
Quality handling should distinguish fresh, stale, and missing samples so policy can degrade safely.
Control provider model
Control outputs are routed by backend:
host: write real interfaces (RAPL/cpufreq),http: send intents to simulator,none.
Result semantics:
appliedblockederrornone(no intent selected for that control path)
This allows desired vs applied behavior auditing.
Current CPU control intents
Supported intent actions:
rapl.set_power_cap_wattsdvfs.set_throttle_pct
DVFS intent is normalized as throttlePct (0..100) to stay portable across heterogeneous CPUs.
Backend-specific frequency writes remain implementation details.
GPU control intent
Supported GPU intent action:
gpu.set_power_cap_watts
Payload:
capWattsPerGpu
Semantics:
- node-level intent is translated to per-device enforcement,
- same cap is applied to all GPUs on the node,
- result is reported as
applied|blocked|errorinNodeTwin.status.controlStatus.gpu.
Current HTTP contracts
Telemetry endpoint:
GET /telemetry/{node}
Accepted minimal payloads include:
{ "packagePowerWatts": 245.3 }
or
{ "cpu": { "packagePowerWatts": 245.3 } }
The simulator exports a richer payload than this minimal contract. Important optional fields currently include:
- top level:
packagePowerWattsinstantPackagePowerWatts
cpu.*:packagePowerWattsinstantPowerWattsutilizationmemoryIntensityioIntensityfreqScaletemperatureCthermalThrottlePct
gpu.*:powerWattsTotalavgPowerWattsTotalutilizationmemoryIntensitycpuFeedIntensitydevices
Controllers should treat these richer fields as additive observability, not as a replacement for the minimal contract above.
Control endpoint:
POST /control/{node}
Example request:
{
"node": "worker-01",
"action": "dvfs.set_throttle_pct",
"throttlePct": 20,
"ts": "2026-02-27T00:00:00Z"
}
Expected response includes:
result(applied|blocked|error)messagestate(best-effort post-state)
Current host contracts
Agent host mode uses Linux interfaces:
- RAPL energy/power cap files
- cpufreq files for observed and enforced frequency bounds
GPU host control backends:
- NVIDIA path: NVIDIA tooling/NVML-compatible power-limit controls
- AMD path: ROCm SMI power-limit controls where supported
If a GPU backend is unavailable on a node, result is blocked (not silent success).
Current deployment convention mounts host /sys into container /host-sys.
CRD integration
Current runtime responsibilities:
- agent publishes
NodeHardwarefor discovered hardware/capability state, - operator writes
NodeTwin.spectargets, - agent reads
NodeTwin.specfor desired state, - agent selects telemetry/control backends via environment variables,
- agent writes control status under
NodeTwin.status.controlStatus.
Today, the stable documented status contract is NodeTwin.status.controlStatus.
If additional diagnostic snapshots are present in some environments, treat them as auxiliary rather than core API contract.
Next step
To see simulator-side algorithm details using these interfaces, read: