Heterogeneous GPU Cluster Benchmark

This page reports results from the heterogeneous GPU cluster benchmark experiment:

Scope

The benchmark compares three baselines on a heterogeneous cluster mixing 5 distinct GPU hardware families plus CPU-only nodes:

  • A: simulator only (Joulie-free)
  • B: Joulie with static partition policy
  • C: Joulie with queue-aware policy

Experimental setup

Cluster and nodes

  • kind control-plane + worker (real control plane)
  • 41 managed KWOK nodes: 33 GPU nodes + 8 CPU-only nodes
  • Workload pods target KWOK nodes via selector + toleration

Node inventory - detailed cluster composition

This is a heterogeneous GPU cluster mixing 5 distinct GPU hardware families across 33 GPU nodes, plus 8 CPU-only nodes.

GPU nodes (33 total, 188 GPUs)

Node prefixCountGPU modelGPUs/nodeGPU cap rangeHost CPUCPU cores/node
kwok-h100-nvl12NVIDIA H100 NVL8200-400 WAMD EPYC 9654 96-Core192
kwok-h100-sxm6NVIDIA H100 80GB HBM34350-700 WIntel Xeon Gold 653064
kwok-l40s7NVIDIA L40S4200-350 WAMD EPYC 9534 64-Core128
kwok-mi300x2AMD Instinct MI300X8350-750 WAMD EPYC 9534 64-Core128
kwok-w79006AMD Radeon PRO W79004200-295 WAMD EPYC 9534 64-Core128

GPU count summary: 96 + 24 + 28 + 16 + 24 = 188 GPUs total across NVIDIA and AMD families.

CPU-only nodes (8 total)

Node prefixCountCPU modelCPU cores/nodeRAM/node
kwok-cpu-highcore2AMD EPYC 9965 192-Core384 (2x192)1536 GiB
kwok-cpu-highfreq2AMD EPYC 9375F 32-Core64 (2x32)770 GiB
kwok-cpu-intensive4AMD EPYC 9655 96-Core192 (2x96)1536 GiB

Total: 41 nodes, 188 GPUs (5 families), ~5800 CPU cores.

Hardware models in simulator

GPU power per device at load fraction g:

P_gpu(g) = IdleW + (PeakW - IdleW) * g^computeGamma

Per-GPU-family physics parameters:

GPU familyIdleW (W)PeakW (W)computeGammaGPU cap range
NVIDIA H100 NVL804001.50200-400 W
NVIDIA H100 80GB HBM31207001.50350-700 W
NVIDIA L40S603501.40200-350 W
AMD Instinct MI300X1007500.85350-750 W
AMD Radeon PRO W7900402951.20200-295 W

computeGamma controls cap sensitivity: higher gamma = more throughput retained under capping. At 65% GPU cap: H100 NVL loses ~24.7%, MI300X loses ~38.2% throughput.

Full power-model details: Power Simulator

Run configuration

  • Seeds: 3
  • Jobs: 200
  • Mean inter-arrival: 0.30 s
  • Time scale: 60x
  • Timeout: 3600 s
  • Perf ratio: 25%, GPU ratio: 35%
  • Workload types: debug_eval, single_gpu_training, cpu_preprocess, cpu_analytics
  • Policy caps: CPU eco at 65%, GPU eco at 65% of peak

Algorithms used

Controller policies

  • static_partition:
    • hpCount = round(N * 0.40) -> ~16 performance nodes, ~25 eco nodes
  • queue_aware_v1:
    • baseCount = round(N * 0.40), dynamic adjustment from live perf-pod count
    • hpCount = clamp(max(baseCount, queueNeed), 2, 20, N)
  • Downgrade guard: performance -> eco deferred while performance-sensitive pods run on node

Results summary

Per-seed results

BaselineSeedWall (s)Throughput (jobs/sim-hr)Energy (kWh sim)Avg power (W)
A1392.7827.0465.5510013
A2218.8049.3635.529741
A3437.6424.2770.609679
B1393.6526.9860.789265
B2216.6549.8533.869377
B3438.6124.2166.419085
C1391.7627.1160.159213
C2217.5349.6534.419493
C3438.1624.2466.349084

Baseline means (3 seeds, all completed)

BaselineMean wall (s)Mean throughput (jobs/sim-hr)Mean energy (kWh sim)Mean cluster power (W)
A349.733.5657.229811
B349.633.6853.689242
C349.133.6753.639263

Relative to A:

  • B: energy -6.2%, throughput +0.4% (negligible)
  • C: energy -6.3%, throughput +0.3% (negligible)

Plot commentary

Runtime distribution

Runtime Distribution by Baseline
  • All three baselines complete within similar wall-time windows across all seeds.
  • No incomplete runs - all 9 seeds completed successfully.

Energy vs makespan

Energy vs Makespan
  • B and C are consistently shifted to lower energy vs A, with near-identical makespan.

Baseline means

Baseline Means
  • Throughput and wall-time bars are indistinguishable across baselines.
  • Energy bars clearly show B and C both below A by ~6%.

Relative tradeoff vs A

Relative Tradeoff vs A
  • Per-seed scatter of energy delta vs throughput delta relative to A.
  • B and C clusters in the lower-energy region with minimal throughput change.

Relative tradeoff bars vs A

Relative Tradeoff Bars vs A
  • Mean energy and throughput deltas: B at -6.2% energy / +0.4% throughput, C at -6.3% / +0.3%.

Hardware family tradeoff vs A

Hardware Family Tradeoff vs A
  • Per-hardware-family energy and throughput tradeoff under Joulie policies.

Hardware family rankings - baseline B

Hardware Family Rankings Baseline B
  • Per-family energy and throughput under B policy relative to A.

Hardware family rankings - baseline C

Hardware Family Rankings Baseline C
  • C shows similar outcomes to B across hardware families.

Completion summary

Completion Summary
  • All baselines achieve 100% completion across all 3 seeds.

Interpretation

Why does Joulie save 6% energy on GPU clusters?

The combination of CPU and GPU eco caps at 65% achieves meaningful energy reduction:

  1. GPU power caps directly reduce GPU power draw: at 65% cap, each GPU on an eco node draws significantly less power. With 188 GPUs across 33 nodes, even a modest per-GPU savings compounds at scale.
  2. CPU caps further reduce host power: eco nodes draw less CPU power in addition to GPU savings.
  3. Throughput preserved: the scheduler distributes performance-sensitive jobs to uncapped performance nodes. The net throughput impact is negligible.

Both policies achieve nearly identical results (~6.2-6.3%) because the workload mix does not create sufficient demand spikes to differentiate queue-aware from static partition.

Best-fit use case

  • Joulie achieves -6.2% energy (static) / -6.3% energy (queue-aware) on heterogeneous GPU clusters with negligible throughput impact.
  • The key enabler is applying GPU power caps in addition to CPU caps on eco nodes.

Implementation details and scripts