GPU Support (NVIDIA + AMD)
Joulie supports node-level GPU power-cap intents for NVIDIA and AMD.
Validation status
GPU support has been validated in simulator mode only (no bare-metal GPU access yet). The host code paths are designed to work on bare metal (NVIDIA + AMD) when GPU nodes are available.
Contract model
NodePowerProfile.spec.gpu.powerCap defines a per-GPU cap intent:
scope: perGpucapWattsPerGpu(absolute, optional)capPctOfMax(percentage, optional)
Precedence:
capWattsPerGpuif present- otherwise
capPctOfMax
The same cap is applied uniformly to all GPUs on the node.
Heterogeneous nodes
Joulie supports heterogeneous GPU fleets by profile percentages:
- performance profile:
capPctOfMax=100 - eco profile:
capPctOfMaxlower than 100 (for example 60)
Optional deterministic mode (simulator-oriented): operator can resolve percentages to absolute watts using model mapping (GPU_MODEL_CAPS_JSON) and write capWattsPerGpu.
Agent host backends
- NVIDIA: host backend uses NVIDIA tooling (power-limit set per device).
- AMD: host backend uses ROCm SMI tooling (
rocm-smi) where supported.
When capabilities are unavailable/unsupported, status is reported as blocked rather than failing the whole reconcile loop.
Simulator mode
Simulator control endpoint accepts gpu.set_power_cap_watts with capWattsPerGpu.
Simulator telemetry includes:
gpu.presentgpu.vendorgpu.countgpu.powerWattsTotalgpu.avgPowerWattsTotalgpu.capWattsPerGpuAppliedgpu.utilization
Per-device entries also expose:
temperatureCthermalThrottleavgPowerWatts
This makes it easier to compare:
- internal power-cap dynamics,
- exported averaged telemetry,
- and sustained thermal behavior during long runs.
Scheduling guidance
Keep workload intent guidance unchanged:
- performance-sensitive pods: prefer
NotIn ["eco"] - eco-only (advanced):
In ["eco"]and optionallyjoulie.io/draining NotIn ["true"]
GPU resource requests (nvidia.com/gpu, amd.com/gpu) are orthogonal to Joulie power-profile labels.
Joulie GPU capping is node-level and not a GPU slicing API.
Example
See: