Heterogeneous GPU Cluster Benchmark

This page reports results from the heterogeneous GPU cluster benchmark experiment:

Scope

The benchmark compares three baselines on a heterogeneous cluster mixing 5 distinct GPU hardware families plus CPU-only nodes:

  • A: simulator only (Joulie-free)
  • B: Joulie with static partition policy
  • C: Joulie with queue-aware policy

Experimental setup

Cluster and nodes

  • kind control-plane + worker (real control plane)
  • 41 managed KWOK nodes: 33 GPU nodes + 8 CPU-only nodes
  • Workload pods target KWOK nodes via selector + toleration

Node inventory - detailed cluster composition

This is a heterogeneous GPU cluster mixing 5 distinct GPU hardware families across 33 GPU nodes, plus 8 CPU-only nodes.

GPU nodes (33 total, 188 GPUs)

Node prefixCountGPU modelGPUs/nodeGPU cap rangeHost CPUCPU cores/node
kwok-h100-nvl12NVIDIA H100 NVL8200–400 WAMD EPYC 9654 96-Core192
kwok-h100-sxm6NVIDIA H100 80GB HBM34350–700 WIntel Xeon Gold 653064
kwok-l40s7NVIDIA L40S4200–350 WAMD EPYC 9534 64-Core128
kwok-mi300x2AMD Instinct MI300X8350–750 WAMD EPYC 9534 64-Core128
kwok-w79006AMD Radeon PRO W79004200–295 WAMD EPYC 9534 64-Core128

GPU count summary: 96 + 24 + 28 + 16 + 24 = 188 GPUs total across NVIDIA and AMD families.

CPU-only nodes (8 total)

Node prefixCountCPU modelCPU cores/nodeRAM/node
kwok-cpu-highcore2AMD EPYC 9965 192-Core384 (2×192)1536 GiB
kwok-cpu-highfreq2AMD EPYC 9375F 32-Core64 (2×32)770 GiB
kwok-cpu-intensive4AMD EPYC 9655 96-Core192 (2×96)1536 GiB

Total: 41 nodes, 188 GPUs (5 families), ~5800 CPU cores.

Hardware models in simulator

GPU power per device at load fraction g:

P_gpu(g) = IdleW + (PeakW - IdleW) * g^computeGamma

Per-GPU-family physics parameters:

GPU familyIdleW (W)PeakW (W)computeGammaGPU cap range
NVIDIA H100 NVL804001.50200–400 W
NVIDIA H100 80GB HBM31207001.50350–700 W
NVIDIA L40S603501.40200–350 W
AMD Instinct MI300X1007500.85350–750 W
AMD Radeon PRO W7900402951.20200–295 W

computeGamma controls cap sensitivity: higher gamma = more throughput retained under capping. At 80% GPU cap: H100 NVL loses ~13.5%, MI300X loses ~22.7% throughput.

Full power-model details: Power Simulator

Run configuration

  • Seeds: 3
  • Mean inter-arrival: 0.12 s
  • Time scale: 60×
  • Timeout: 14400 s
  • Perf ratio: 15%, eco ratio: 0%, GPU ratio: 45%
  • Workload types: debug_eval, single_gpu_training, cpu_preprocess, cpu_analytics
    • Note: distributed_training and parameter_server_training were present in the archived run but removed from future benchmarks (require a gang scheduler to avoid deadlock)
  • Policy caps: CPU eco at 80%, GPU eco at 80% of peak

Algorithms used

Controller policies

  • static_partition:
    • hpCount = round(N * 0.45) → ~18 performance nodes, ~23 eco nodes
  • queue_aware_v1:
    • baseCount = round(N * 0.50), dynamic adjustment from live perf-pod count
    • hpCount = clamp(max(baseCount, queueNeed), 2, 10, N)
  • Downgrade guard: performance → eco deferred while performance-sensitive pods run on node

Results summary

Per-seed results

BaselineSeedWall (s)Throughput (jobs/sim-hr)Energy (kWh sim)Avg power (W)Status
A11452211.24--INCOMPLETE (gang deadlock)
A21847.490.391639.8553260completed
A32149.776.812132.4359518completed
B11452111.24--INCOMPLETE (gang deadlock)
B21978.684.391758.2353316completed
B32148.376.862081.3458129completed
C12040.980.001874.6355113completed
C21980.584.311754.2953147completed
C32031.581.281967.3158105completed

Baseline means (seeds 2+3 fair comparison)

BaselineMean wall (s)Mean throughput (jobs/sim-hr)Mean energy (kWh sim)Mean cluster power (W)
A1998.583.601886.156389
B2063.580.631919.855723
C2017.681.861865.455455

Relative to A (seeds 2+3):

  • B: energy +1.8%, throughput −3.6%
  • C: energy −1.3%, throughput −1.0%

Plot commentary

Runtime distribution

Runtime Distribution by Baseline
  • Seeds 2 and 3 show near-overlapping distributions; seed 1 deadlocks excluded.

Energy vs makespan

Energy vs Makespan
  • Small energy differences relative to large absolute values (~1900 kWh sim) dominated by GPU idle power.
  • C has lower variance (all 3 seeds completed).

Baseline means

Baseline Means
  • Throughput and wall-time bars show modest inter-baseline differences.
  • Energy bars are nearly flat: B slightly above A, C marginally below.

Relative tradeoff vs A

Relative Tradeoff vs A
  • Per-seed scatter of energy delta vs throughput delta relative to A.
  • C seeds cluster near the origin; B seeds show energy increase with throughput loss.

Relative tradeoff bars vs A

Relative Tradeoff Bars vs A
  • Mean energy and throughput deltas: B at +1.8% energy / -3.6% throughput, C at -1.3% / -1.0%.

Hardware family tradeoff vs A

Hardware Family Tradeoff vs A
  • H100 NVL (dominant energy contributor) does not see expected energy reduction - CPU throttling on GPU nodes extends job duration.
  • MI300X more sensitive to capping (lower computeGamma).

Hardware family rankings - baseline B

Hardware Family Rankings Baseline B
  • Per-family energy and throughput under B policy relative to A.
  • MI300X shows the largest percentage throughput loss due to its lower computeGamma.

Hardware family rankings - baseline C

Hardware Family Rankings Baseline C
  • C shows better outcomes than B across most families, especially H100 NVL.

Workload type tradeoff vs A

Workload Type Tradeoff vs A
  • GPU-heavy jobs are most affected; CPU-only jobs on CPU-only nodes show negligible impact.

Workload type rankings - baseline B

Workload Type Rankings Baseline B
  • single_gpu_training shows the highest slowdown: CPU throttling on GPU nodes limits data-feed throughput.

Completion summary

Completion Summary
  • C achieves 100% completion; A and B each have 1 failed seed from gang deadlock.

Gang scheduling deadlock (seed 1)

Baselines A and B both timed out in seed 1 with 1000+ pods permanently stuck. Root cause: multi-pod jobs without a gang scheduler create circular partial allocation - each job holds some nodes partially occupied, waiting for pods that cannot land. Baseline C avoided deadlock via incidental pod evictions from operator reconcile cycles.

Multi-pod job types have been removed from all future benchmarks.

Interpretation

Why does Joulie not save energy on GPU clusters?

  1. GPU idle power dominates (~75–85% of total): H100 NVL alone consumes 80 W/GPU × 96 GPUs = 7680 W idle floor. Any job duration extension accumulates proportionally more idle energy.

  2. CPU cap slows GPU jobs: Joulie’s eco profile applies CPU frequency throttling to GPU nodes. The throttled CPU cannot feed the GPU fast enough (cpuFeedFactor mechanism), reducing GPU effective speed and extending job duration. This outweighs CPU power savings.

  3. Wrong control axis: The energy-efficient lever on GPU nodes is GPU power cap, not CPU frequency reduction.

Queue-aware (C) partially mitigates this by reducing eco-node count during GPU-heavy phases, keeping more nodes uncapped.

Best-fit use case

  • queue_aware_v1 achieves a marginal −1.3% energy saving on heterogeneous GPU clusters.
  • static_partition increases energy by +1.8% due to indiscriminate CPU throttling on GPU nodes.
  • Future work: workload-type-aware policy - apply CPU caps only on CPU-only nodes, GPU caps selectively on GPU nodes.

Implementation details and scripts