GPU Support (NVIDIA + AMD)

Joulie supports node-level GPU power-cap intents for NVIDIA and AMD.

Validation status

GPU support has been validated in simulator mode only (no bare-metal GPU access yet). The host code paths are designed to work on bare metal (NVIDIA + AMD) when GPU nodes are available.

Contract model

NodeTwin.spec.gpu.powerCap defines a per-GPU cap intent:

scope: perGpu
capWattsPerGpu (absolute, optional)
capPctOfMax (percentage, optional)

Precedence:

capWattsPerGpu if present
otherwise capPctOfMax

The same cap is applied uniformly to all GPUs on the node.

Heterogeneous nodes

Joulie supports heterogeneous GPU fleets by profile percentages:

performance profile: capPctOfMax=100
eco profile: capPctOfMax lower than 100 (for example 60)

Optional deterministic mode (simulator-oriented): operator can resolve percentages to absolute watts using model mapping (GPU_MODEL_CAPS_JSON) and write capWattsPerGpu.

Agent host backends

NVIDIA: host backend uses NVIDIA tooling (power-limit set per device).
AMD: host backend uses ROCm SMI tooling (rocm-smi) where supported.

When capabilities are unavailable/unsupported, status is reported as blocked rather than failing the whole reconcile loop.

Simulator mode

Simulator control endpoint accepts gpu.set_power_cap_watts with capWattsPerGpu.

Simulator telemetry includes:

gpu.present
gpu.vendor
gpu.count
gpu.powerWattsTotal
gpu.avgPowerWattsTotal
gpu.capWattsPerGpuApplied
gpu.utilization

Per-device entries also expose:

temperatureC
thermalThrottle
avgPowerWatts

This makes it easier to compare:

internal power-cap dynamics,
exported averaged telemetry,
and sustained thermal behavior during long runs.

Scheduling guidance

Workload placement intent is expressed via the joulie.io/workload-class pod annotation:

performance: must run on full-power nodes (extender hard-rejects eco nodes)
standard: default, can run on any node; adaptive scoring steers toward eco when performance nodes are congested

GPU resource requests (nvidia.com/gpu, amd.com/gpu) are independent from Joulie workload classes. Joulie GPU capping is node-level and not a GPU slicing API.

Example

See:

Example 07: simulator-gpu-powercaps