CPU-Only Benchmark

This page reports results from the CPU-only cluster benchmark experiment:

Scope

The benchmark compares three baselines on a pure CPU cluster:

  • A: simulator only (Joulie-free)
  • B: Joulie with static partition policy
  • C: Joulie with queue-aware policy

It evaluates energy and throughput under real Kubernetes scheduling with KWOK nodes and simulated power control.

Experimental setup

Cluster and nodes

  • kind control-plane + worker (real control plane)
  • 8 managed KWOK nodes - CPU only, no GPUs
  • Workload pods target KWOK nodes via selector + toleration

Node inventory

Node prefixCountCPU modelCPU coresRAM
kwok-cpu-highcore2AMD EPYC 9965 192-Core384 (2x192)1536 GiB
kwok-cpu-highfreq2AMD EPYC 9375F 32-Core64 (2x32)770 GiB
kwok-cpu-intensive4AMD EPYC 9655 96-Core192 (2x96)1536 GiB

Total: 8 nodes, 2304 CPU cores, 0 GPUs.

Hardware models in simulator

CPU power per node:

P(u, f) = IdleW + (PeakW - IdleW) * u^AlphaUtil * f^BetaFreq
CPU familyIdleW (W)PeakW (W)AlphaUtilBetaFreq
AMD EPYC 9965 192-Core1209601.151.30
AMD EPYC 9375F 32-Core604801.101.25
AMD EPYC 9655 96-Core957601.121.28

Full power-model details: Power Simulator

Run configuration

  • Seeds: 3
  • Jobs: 300
  • Mean inter-arrival: 0.20 s
  • Time scale: 60x
  • Timeout: 1800 s
  • Perf ratio: 20%, GPU ratio: 0%
  • Workload types: cpu_preprocess, cpu_analytics
  • Policy caps: CPU eco at 65% of peak

Algorithms used

Controller policies

  • static_partition:
    • hpCount = round(N * 0.30) -> 2 performance nodes, 6 eco nodes
  • queue_aware_v1:
    • baseCount = round(N * 0.30), dynamic from live perf-pod count
    • hpCount = clamp(max(baseCount, queueNeed), 2, 15, N)
  • Downgrade guard: performance -> eco deferred while performance-sensitive pods still run on node

Results summary

Per-seed results

BaselineSeedWall (s)Throughput (jobs/sim-hr)Energy (kWh sim)Avg power (W)
A1317.98113.2117.633326
A2276.18130.3515.013261
A3239.74150.1713.253315
B1330.14109.0412.222221
B2275.86130.5010.102197
B3240.20149.878.982242
C1328.92109.4512.252235
C2275.26130.789.992177
C3239.66150.219.022259

Baseline means (3 seeds, all completed)

BaselineMean wall (s)Mean throughput (jobs/sim-hr)Mean energy (kWh sim)Mean cluster power (W)
A278.0131.2415.303301
B282.1129.8010.432220
C281.3130.1510.422224

Relative to A:

  • B: energy -31.8%, throughput -1.1% (negligible)
  • C: energy -31.9%, throughput -0.8% (negligible)

Plot commentary

Runtime distribution

Runtime Distribution by Baseline
  • All three baselines complete in nearly identical wall-time windows.
  • Run-to-run seed jitter is larger than any inter-baseline difference.

Energy vs makespan

Energy vs Makespan
  • B and C are consistently lower-energy than A with near-identical makespan across all 3 seeds.
  • Both Joulie baselines cluster tightly together.

Baseline means

Baseline Mean Metrics
  • Energy is the main differentiator; throughput and wall-time bars are indistinguishable.

Completion summary

Completion Summary
  • All 3 seeds completed for all baselines; no timeouts or gang-scheduling issues.

Interpretation

Joulie reduces energy by ~32% without throughput penalty on a CPU-only cluster because:

  1. The cluster is over-provisioned (2304 cores, lightweight jobs) - eco nodes have spare CPU cores to compensate for throttled frequency.
  2. CPU sensitivityCPU for cpu_preprocess/cpu_analytics is moderate (0.7-0.9): a 35% frequency reduction causes 25-32% per-job slowdown, but job completion time stays flat because the scheduler redistributes load.
  3. Eco nodes draw significantly less power for the same simulated duration -> energy falls without extending makespan.
  4. The aggressive 65% eco cap maximizes power savings on eco nodes compared to milder caps.

Best-fit use case

The strongest observed benefit is:

  • energy reduction (-31.8% static, -31.9% queue-aware) with negligible throughput penalty in CPU-only mixed workload clusters.

Both policies perform equivalently on CPU-only clusters. static_partition is simpler to configure; queue_aware_v1 becomes more valuable when the performance-sensitive fraction is larger or more bursty.

Implementation details and scripts