Workload Distributions
This page documents the statistical distributions and priors behind the current workload generator.
Use it together with:
What this page is for
The generator is no longer just a flat random-job emitter. It now uses explicit priors for:
- arrival timing,
- GPU-count skew,
- duration shape,
- utilization,
- memory pressure,
- multi-pod workload structure.
This page makes those priors visible and explains why they are reasonable.
1. Arrival model
The current implementation uses a lightweight NHPP-like process:
- a baseline exponential inter-arrival sampler,
- modulated by an hourly multiplier,
- with optional burst windows.
That is not yet a full trace-fit pipeline, but it is designed to capture the main structure reported in Helios and in scheduler-evaluation work:
- workday-heavy submission activity,
- visible midday dips,
- lower submission activity overnight,
- and occasional bursty periods.
Current hourly multipliers
00:00-07:59:0.7009:00-11:59:1.2012:00:0.8513:00-17:59:1.1518:00:0.8520:00-22:59:1.00- other hours:
0.95
Burst overlay
Optional burst parameters:
--burst-day-probability--burst-multiplier--burst-mean-jobs
This is inspired by Blox-style simulator methodology, where trace-derived rates are combined with controlled spikes for stress testing.
2. GPU-count prior
The generator follows the trace-backed pattern that:
- single-GPU jobs dominate job count,
- larger jobs dominate GPU-time.
Current default categorical prior
P(G=1) = 0.80P(G=2) = 0.10P(G=4) = 0.07P(G=8) = 0.03
This is a practical HEP-oriented prior derived from the shape reported in Helios rather than a literal replay of one production cluster.
3. Duration model
The generator uses heavy-tailed duration priors, because public AI cluster traces consistently show that durations are not well represented by narrow or symmetric distributions.
Current family-level approximations
debug_eval- short-biased lognormal
- clamped to
30s .. 1000s
single_gpu_training- long-tail lognormal
- clamped to
15min .. 7d
distributed_training- longer-tail lognormal
- clamped to
20min .. 14d
parameter_server_training- clamped to
15min .. 7d
- clamped to
hpo_experiment- clamped to
20min .. 3d
- clamped to
cpu_preprocess- clamped to
2min .. 8h
- clamped to
cpu_analytics- clamped to
5min .. 24h
- clamped to
These choices are intended to preserve the Helios-style pattern:
- many short exploratory/evaluation jobs,
- fewer long-running training jobs,
- a strong long tail extending far beyond the median.
4. Requested resources vs used resources
One of the most important modeling choices in the generator is that it separates:
- requested resources
- used by the scheduler,
- used resources / intensity profile
- used by the physical model.
This separation matters because public studies report that:
- CPU and memory are often requested proportionally to GPUs,
- but host CPU compute utilization can still be low,
- while memory pressure can remain high,
- and GPU utilization is often substantially below 100% in production training clusters.
5. GPU utilization prior
The current implementation seeds GPU utilization from the ATC'19 Philly mean GPU-utilization values by GPU count.
Reference points used in the current generator
1 GPU:0.52384 GPUs:0.45188 GPUs:0.589916 GPUs:0.4039
The generator then interpolates this into a small-cluster prior and adjusts by workload family:
debug_eval: lower effective GPU utilizationdistributed_training: lower effective GPU utilization than single-GPU trainingparameter_server_training: similar downward adjustment
This is meant to reflect the well-known difference between:
- GPU allocation,
- GPU utilization,
- and actual useful throughput.
6. CPU utilization prior
The generator samples CPU utilization independently from CPU request size.
Examples from the current implementation:
cpu_preprocess:0.50 .. 0.75cpu_analytics:0.75 .. 0.95debug_eval:0.20 .. 0.40single_gpu_training:0.20 .. 0.45distributed_training:0.15 .. 0.35parameter_server_training:0.20 .. 0.40
This reflects the observed pattern that many accelerator-heavy jobs are not CPU-compute-bound even though they still require host CPU resources.
7. Memory / IO / CPU-feed priors
The current generator samples three explicit bottleneck signals:
memoryIntensityioIntensitycpuFeedIntensityGpu
These are the bridge from workload generation into the throttling model.
Interpretation
- high
memoryIntensity- more memory-dominated behavior
- softer slowdown under CPU throttling
- high
ioIntensity- even softer CPU-side slowdown
- high
cpuFeedIntensityGpu- GPU throughput becomes more sensitive to CPU throttling
This is deliberately aligned with the more realistic class-aware slowdown semantics described in Hardware Modeling.
8. Structure prior
The current generator samples a logical workload family first, then derives pod structure from it.
This is directly motivated by the Alibaba PAI trace and Kubernetes-native ML systems, where a single logical workload may span multiple pods.
Structure classes used today
- single-pod GPU job
- distributed training
- parameter-server style training
- HPO experiment
- CPU-only preprocessing / analytics jobs
Current expansion behavior
distributed_training- launcher + worker pods
gang=true
parameter_server_training- parameter-server pods + worker pods
gang=true
hpo_experiment- controller + multiple trial pods
- no gang requirement by default in the current simulator path
9. Throttling-sensitivity motivation
The workload report does not just motivate arrival and resource distributions; it also motivates class-aware slowdown.
That is why Joulie models compute-bound, memory-bound, and mixed regimes rather than using a single slowdown curve for every job.
The core literature-backed idea is:
- compute-bound workloads should slow down more under reduced compute roof,
- memory-bound workloads should often slow down less,
- but the exact attenuation depends on how bandwidth, clocks, and control surfaces interact.
The workload generator therefore emits priors that support this model rather than fighting it.
10. What is implemented vs not yet implemented
Implemented today
- day-shaped arrivals with bursts
- heavy-tailed durations
- GPU-count skew
- workload-family-based generation
- multi-pod logical workloads
- gang metadata for distributed/PS workloads
- shared workload intensity profiles
- explicit resource-vs-utilization separation
Not yet implemented
- direct fitting from Helios / Philly / Alibaba traces inside
workloadgen - profile-bundle loading from a fitted YAML/JSON file
- phased time-series workload profiles
- explicit network and disk models
- calibrated per-framework workload templates
So the current implementation is research-informed and structure-aware, but not yet a full trace-fitting platform.
References
- [WD1] HeliosData repository
https://github.com/S-Lab-System-Group/HeliosData - [WD2] Tianwei Zhang et al., Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters (SC'21)
https://tianweiz07.github.io/Papers/21-sc.pdf - [WD3] Philly traces repository
https://github.com/msr-fiddle/philly-traces - [WD4] Myeongjae Jeon et al., Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (USENIX ATC'19)
https://www.usenix.org/system/files/atc19-jeon.pdf - [WD5] Alibaba cluster-trace-gpu-v2020 README
https://github.com/alibaba/clusterdata/blob/master/cluster-trace-gpu-v2020/README.md - [WD6] Samuel Williams et al., Roofline: an insightful visual performance model for multicore architectures
https://dl.acm.org/doi/10.1145/1498765.1498785 - [WD7] Blox arrival/burst evaluation reference
https://arxiv.org/html/2312.12621v1 - [WD8] David Meisner et al., Memory Performance at Reduced CPU Clock Speeds, HotPower'12
https://www.usenix.org/system/files/conference/hotpower12/hotpower12-final21.pdf - [WD9] Characterizing the Impact of GPU Power Management on HPC Applications, PMBS 2025 preprint used in the workload report
https://arxiv.org/abs/2501.16371 - [WD10] Tapasya Patki et al., Comparing GPU Power and Frequency Capping for Energy Savings in Scientific Applications, SC Workshops 2019
https://ieeexplore.ieee.org/document/8944989