Simulator on Joulie

Installation

Mon, 01 Jan 0001 00:00:00 +0000

This page covers how to install the Joulie simulator in a Kubernetes cluster.

Prerequisites

A running Kubernetes cluster (real or kind for local development)
kubectl configured for the target cluster
helm v3+ (for Helm installation)

Install via Helm (recommended)

The simulator is published as an OCI Helm chart. Install it with:

helm install joulie-sim oci://registry.cern.ch/mbunino/joulie/joulie-sim \
 -n joulie-system --create-namespace

To customize values, download the default values first:

helm show values oci://registry.cern.ch/mbunino/joulie/joulie-sim > values.yaml

Then install with overrides:

Workload and Power Simulator

Mon, 01 Jan 0001 00:00:00 +0000

The Joulie simulator lets you run full control-loop experiments on virtual clusters without real hardware. It keeps Kubernetes scheduling real while simulating hardware telemetry, power dynamics, and thermal behavior per node.

This page covers the simulator’s architecture, HTTP API, and integration points. Detailed subsystems are documented on dedicated pages linked throughout.

Architecture at a glance

The simulator extends the same control path used on real nodes:

Node labels define simulated hardware identity.
Operator resolves hardware from NodeHardware when available, otherwise from labels/inventory fallback.
Operator writes desired node profile (NodeTwin.spec).
Agent reads desired state and sends control intents.
Simulator emulates telemetry/control behavior per node and exposes HTTP endpoints.
Next reconcile loop reacts to updated simulated state.

The diagram shows the end-to-end loop:

Workload Generation

Mon, 01 Jan 0001 00:00:00 +0000

This page documents how Joulie generates realistic AI workload traces for the simulator.

It is separate from Workload Simulator:

this page explains how traces are generated,
the workload-simulator page explains how those traces are consumed at runtime.

Scope

The current generator is designed to be realistic for:

AI-oriented Kubernetes clusters,
CPU + GPU workloads,
memory-pressure-sensitive jobs,
multi-pod logical workloads such as distributed training and HPO-style experiments.

The current generator does not explicitly model:

Workload Distributions

Mon, 01 Jan 0001 00:00:00 +0000

This page documents the statistical distributions and priors behind the current workload generator.

Use it together with:

What this page is for

The generator is no longer just a flat random-job emitter. It now uses explicit priors for:

arrival timing,
GPU-count skew,
duration shape,
utilization,
memory pressure,
multi-pod workload structure.

This page makes those priors visible and explains why they are reasonable.

1. Arrival model

The current implementation uses a lightweight NHPP-like process:

Kubernetes AI Workloads

Mon, 01 Jan 0001 00:00:00 +0000

This page explains how the logical workload structures used by Joulie map onto common Kubernetes-native AI workload patterns.

It is mainly a documentation page today. The current simulator generator emits the structure metadata and pod-expanded jobs, but it does not yet render PyTorchJob, MPIJob, or Katib Experiment manifests directly.

Why this page exists

The workload-generation report makes an important point:

realistic AI workloads are often not single pods,
and a single logical workload may map to:
- a launcher + workers,
- parameter servers + workers,
- or a controller + many HPO trial pods.

That distinction matters even in a simulator, because power and slowdown should often be understood at the logical workload level, not only at the pod level.

Workload Simulator

Mon, 01 Jan 0001 00:00:00 +0000

This page documents the workload-side simulation model.

Trace generation methodology, statistical priors, multi-pod workload structure, and workload-generation references are documented in Workload Generation.

Scope

The workload simulator handles:

trace/job ingestion,
pod creation and placement via real scheduler,
per-job progress updates,
completion and pod deletion,
class inference from scheduling constraints.

Power/control dynamics are documented separately in:

Power Simulator

Trace-driven workload model

Enable with:

SIM_WORKLOAD_TRACE_PATH=/path/to/trace.jsonl

The simulator loads type=job records and schedules pods over time according to submit offsets.

Power Simulator

Mon, 01 Jan 0001 00:00:00 +0000

This page describes the simulator runtime mechanics (control/state/energy paths).

The canonical physical model, provenance, and hardware assumptions are documented in:

Hardware Modeling

For workload progression semantics:

Workload Simulator

Scope

The power simulator runtime is responsible for:

keeping per-node control state (CPU cap, DVFS throttle, GPU cap),
applying control actions from /control/{node},
updating dynamics with settling/ramp behavior,
exposing power telemetry on /telemetry/{node},
integrating energy over time (/debug/energy).

Runtime state and controls

Main node state includes:

Hardware Modeling

Mon, 01 Jan 0001 00:00:00 +0000

This simulator section now treats hardware modeling as a shared hardware concept rather than a simulator-only detail.

The canonical page is:

Hardware Modeling and Physical Power Model

Use that page for:

CPU and GPU model provenance
physical assumptions behind caps and slowdown
heterogeneous-node semantics
current limitations and calibration status

From the simulator point of view, the important relationship is simple:

the simulator implements the modeling assumptions documented there
the agent relies on the same hardware assumptions when interpreting caps and backend limits
simulator runtime pages describe how those models are exercised in experiments

For simulator-specific flow, continue with:

Simulator Metrics

Mon, 01 Jan 0001 00:00:00 +0000

This page documents Prometheus metrics exposed by the simulator (simulator/cmd/simulator/main.go).

Endpoint:

path: /metrics
address: simulator HTTP listen address (SIM_ADDR, default :18080)

Related debug endpoints (non-Prometheus):

/debug/nodes
/debug/events
/debug/energy

HTTP/request metrics

joulie_sim_requests_total{route,method,status} (counter)
- total HTTP requests by route/method/status
joulie_sim_request_duration_seconds{route,method} (histogram)
- request latency

Control-path metrics

joulie_sim_controls_total{node,action} (counter)
- received control actions by node/action
joulie_sim_control_actions_total{node,action,result} (counter)
- control action outcomes
- result: applied|blocked|error

Per-node simulated state metrics

joulie_sim_node_cap_watts{node} (gauge)
- current simulated effective cap
joulie_sim_node_rapl_cap_watts{node} (gauge)
- simulated RAPL cap value
joulie_sim_node_throttle_pct{node} (gauge)
- simulated DVFS throttle percent
joulie_sim_node_power_watts{node} (gauge)
- simulated exported node power
joulie_sim_node_cpu_util{node} (gauge)
- simulated CPU utilization
joulie_sim_node_freq_scale{node} (gauge)
- simulated frequency scale
joulie_sim_node_running_pods{node} (gauge)
- running pods observed on the node
joulie_sim_node_class_info{node,class} (gauge)
- class assignment marker (1 on active class)

Workload/job metrics

joulie_sim_job_submitted_total{class} (counter)
- jobs submitted by class
joulie_sim_job_completed_total{class,node} (counter)
- jobs completed by class and node
joulie_sim_job_completion_seconds (histogram)
- job completion latency distribution

Notes

Prometheus metrics capture online simulator state and request/control behavior.
Integrated node/cluster energy totals are exposed through /debug/energy (JSON), not as Prometheus time series in the current implementation.
Richer thermal and averaged-vs-instantaneous details are currently exposed through the HTTP telemetry/debug endpoints rather than as separate Prometheus gauges.
In particular, fields such as instantPackagePowerWatts, cpu.temperatureC, cpu.thermalThrottlePct, and per-device GPU averaged power live in /telemetry/{node} and /debug/nodes.