Joulie

Core Concepts

Mon, 01 Jan 0001 00:00:00 +0000

Before installing Joulie, understand the control model.

Problem Joulie addresses

Clusters running AI/scientific workloads need better power control:

reduce energy use and power spikes,
keep workload performance predictable,
provide a path to greener operation (power envelope and carbon-aware strategies).

Joulie is currently a PoC focused on Kubernetes-native control loops and simulation.

Main components

Operator (cmd/operator): cluster-level policy brain
- decides desired node power profile/cap assignments
- resolves discovered hardware against the inventory
- writes desired state as NodePowerProfile
Agent (cmd/agent): node-level actuator
- discovers local CPU/GPU hardware and capability
- reads desired state and telemetry configuration
- enforces power controls (CPU + GPU)
- publishes discovered hardware as NodeHardware
- exports metrics/status
Simulator (simulator/): digital-twin execution environment
- keeps scheduling real, simulates telemetry/control behavior
- enables repeatable experiments without requiring real hardware writes

Key CRDs

NodeHardware (joulie.io/v1alpha1)
- discovered CPU/GPU identity, capability, and cap-range visibility for one node
NodePowerProfile (joulie.io/v1alpha1)
- desired node policy state (performance / eco, optional power cap)
TelemetryProfile (joulie.io/v1alpha1)
- where telemetry/control inputs come from (host, http, …), and how controls are sent

Policy states and intent

Node supply is represented through joulie.io/power-profile:

Quickstart

Mon, 01 Jan 0001 00:00:00 +0000

This page is the fastest path to run Joulie. For conceptual context first, read Core Concepts.

Prerequisites

Kubernetes cluster with worker nodes
Node Feature Discovery (NFD) deployed
Optional for real enforcement: nodes exposing writable power interfaces
- RAPL power limit files, or
- cpufreq sysfs interfaces

Install from release (recommended)

Install directly from OCI chart release:

helm upgrade --install joulie oci://registry.cern.ch/mbunino/joulie/joulie \
 --version <version> \
 -n joulie-system \
 --create-namespace \
 -f values/joulie.yaml

Label nodes managed by the operator

Important: Joulie will only target nodes with a specific label, and ignore all the others. By default, install does not auto-select nodes. Default expected selector value is:

Pod Compatibility for Joulie

Mon, 01 Jan 0001 00:00:00 +0000

Joulie uses Kubernetes scheduling constraints as the single source of truth for workload placement intent.

Power profile supply is exposed on node label:

joulie.io/power-profile=performance
joulie.io/power-profile=eco
joulie.io/draining=true|false (independent transition flag)

Workload behavior:

performance workload (recommended): require joulie.io/power-profile NotIn ["eco"]
eco workload: require joulie.io/power-profile=eco
unconstrained workload: no power-profile affinity, can run on either profile

Best-effort Pod (unconstrained, starting point)

This is the default and recommended starting spec. Do not set power-profile affinity: Kubernetes can schedule the pod on either eco or performance nodes.

Agent Runtime Modes

Mon, 01 Jan 0001 00:00:00 +0000

The agent supports two runtime modes:

daemonset: real-hardware mode, one pod per real node.
pool: simulation mode, one pod hosts many logical per-node controllers.

Chart templates:

charts/joulie/templates/agent-daemonset.yaml
charts/joulie/templates/agent-statefulset.yaml

DaemonSet mode (real hardware)

Required runtime settings

securityContext.privileged: true
Host mount:
- host path /sys -> container path /host-sys
Env:
- NODE_NAME from spec.nodeName
- AGENT_MODE=daemonset (default)
- optional RECONCILE_INTERVAL (default 20s)
- optional SIMULATE_ONLY=true (skip host writes, log requested actions)
- optional METRICS_ADDR (default :8080)

Pool mode (KWOK / simulation)

Pool mode preserves per-node semantics but shards logical node controllers across replicas.

CPU Support and Power Capping

Mon, 01 Jan 0001 00:00:00 +0000

Joulie supports node-level CPU power capping through NodePowerProfile intents enforced by the agent.

Contract model

CPU intent is defined in NodePowerProfile.spec.cpu:

packagePowerCapWatts (optional absolute cap)
packagePowerCapPctOfMax (optional normalized profile intent)

Precedence:

packagePowerCapWatts if present
otherwise packagePowerCapPctOfMax

Policy behavior

Operator profile assignment remains performance vs eco. CPU cap values are generated per profile and written into NodePowerProfile:

performance profile typically maps to a higher cap (often 100%)
eco profile maps to a lower cap

For heterogeneous nodes, percentage-based intent remains useful because each node resolves normalized intent using node-local capabilities. If percentage intent cannot be converted to watts (for example missing RAPL range), the agent applies a DVFS percent fallback path when possible.

CRD and Policy Model

Mon, 01 Jan 0001 00:00:00 +0000

This page defines Joulie’s core contract:

demand comes from pod scheduling constraints,
supply is exposed by node power-profile labels,
discovered hardware is published through NodeHardware,
desired state is published through NodePowerProfile.

APIs

Group/version:

joulie.io/v1alpha1

CRDs:

NodeHardware (nodehardwares, cluster-scoped)
NodePowerProfile (nodepowerprofiles, cluster-scoped)
TelemetryProfile (telemetryprofiles, cluster-scoped)

CRD definitions live in:

config/crd/bases/joulie.io_nodehardwares.yaml
config/crd/bases/joulie.io_nodepowerprofiles.yaml
config/crd/bases/joulie.io_telemetryprofiles.yaml

Demand model (workloads)

Workload class is inferred from Kubernetes scheduling constraints on key:

joulie.io/power-profile

Classification:

performance demand:
- pod excludes eco in required scheduling constraints (recommended pattern: NotIn ["eco"])
- compatibility path: explicit nodeSelector joulie.io/power-profile=performance
eco demand:
- pod requires joulie.io/power-profile=eco
- advanced pattern: also exclude joulie.io/draining=true with NotIn ["true"]
general demand:
- no explicit power-profile requirement (unconstrained)

Classification source is affinity/selector, not a custom intent label.

GPU Support (NVIDIA + AMD)

Mon, 01 Jan 0001 00:00:00 +0000

Joulie supports node-level GPU power-cap intents for NVIDIA and AMD.

Validation status

GPU support has been validated in simulator mode only (no bare-metal GPU access yet). The host code paths are designed to work on bare metal (NVIDIA + AMD) when GPU nodes are available.

Contract model

NodePowerProfile.spec.gpu.powerCap defines a per-GPU cap intent:

scope: perGpu
capWattsPerGpu (absolute, optional)
capPctOfMax (percentage, optional)

Precedence:

capWattsPerGpu if present
otherwise capPctOfMax

The same cap is applied uniformly to all GPUs on the node.

Workload and Power Simulator

Mon, 01 Jan 0001 00:00:00 +0000

This document defines the Joulie simulator design and how it integrates with Joulie.

Architecture at a glance

The simulator extends the same control path used on real nodes:

Node labels define simulated hardware identity.
Operator resolves hardware from NodeHardware when available, otherwise from labels/inventory fallback.
Operator writes desired node profile (NodePowerProfile).
Agent reads desired state and sends control intents.
Simulator emulates telemetry/control behavior per node and exposes HTTP endpoints.
Next reconcile loop reacts to updated simulated state.

The diagram shows the end-to-end loop:

Workload Generation

Mon, 01 Jan 0001 00:00:00 +0000

This page documents how Joulie generates realistic AI workload traces for the simulator.

It is separate from Workload Simulator:

this page explains how traces are generated,
the workload-simulator page explains how those traces are consumed at runtime.

Scope

The current generator is designed to be realistic for:

AI-oriented Kubernetes clusters,
CPU + GPU workloads,
memory-pressure-sensitive jobs,
multi-pod logical workloads such as distributed training and HPO-style experiments.

The current generator does not explicitly model:

Workload Distributions

Mon, 01 Jan 0001 00:00:00 +0000

This page documents the statistical distributions and priors behind the current workload generator.

Use it together with:

What this page is for

The generator is no longer just a flat random-job emitter. It now uses explicit priors for:

arrival timing,
GPU-count skew,
duration shape,
utilization,
memory pressure,
multi-pod workload structure.

This page makes those priors visible and explains why they are reasonable.

1. Arrival model

The current implementation uses a lightweight NHPP-like process:

Kubernetes AI Workloads

Mon, 01 Jan 0001 00:00:00 +0000

This page explains how the logical workload structures used by Joulie map onto common Kubernetes-native AI workload patterns.

It is mainly a documentation page today. The current simulator generator emits the structure metadata and pod-expanded jobs, but it does not yet render PyTorchJob, MPIJob, or Katib Experiment manifests directly.

Why this page exists

The workload-generation report makes an important point:

realistic AI workloads are often not single pods,
and a single logical workload may map to:
- a launcher + workers,
- parameter servers + workers,
- or a controller + many HPO trial pods.

That distinction matters even in a simulator, because power and slowdown should often be understood at the logical workload level, not only at the pod level.

Joulie Operator

Mon, 01 Jan 0001 00:00:00 +0000

The operator is Joulie’s cluster-level decision engine.

It does not write host power interfaces directly. Instead, it decides desired node states and publishes them through Kubernetes objects and labels.

In practice, the operator answers one question over and over: which nodes should currently supply performance capacity, and which can safely supply eco capacity?

Responsibilities

At each reconcile tick, the operator:

selects eligible managed nodes,
reads NodeHardware when available and falls back to node labels when it is not,
resolves hardware identity against the shared inventory,
classifies workload demand from pod scheduling constraints,
runs a policy algorithm to compute a plan,
applies transition guards for safe downgrades,
writes desired node targets (NodePowerProfile) and node supply labels.

The agent then enforces those targets node-by-node.

Workload Simulator

Mon, 01 Jan 0001 00:00:00 +0000

This page documents the workload-side simulation model.

Trace generation methodology, statistical priors, multi-pod workload structure, and workload-generation references are documented in Workload Generation.

Scope

The workload simulator handles:

trace/job ingestion,
pod creation and placement via real scheduler,
per-job progress updates,
completion and pod deletion,
class inference from scheduling constraints.

Power/control dynamics are documented separately in:

Power Simulator

Trace-driven workload model

Enable with:

SIM_WORKLOAD_TRACE_PATH=/path/to/trace.jsonl

The simulator loads type=job records and schedules pods over time according to submit offsets.

Hardware Modeling and Physical Power Model

Mon, 01 Jan 0001 00:00:00 +0000

This page documents how Joulie models CPUs and GPUs across the project using a mix of:

official vendor specifications and management APIs,
public measured power curves, and
explicit proxy models where public exact curves are not yet available.

It serves two closely related purposes:

for the agent, it describes the hardware assumptions used to resolve caps, interpret device limits, and reason about how throttling affects attainable performance
for the simulator, it describes the physical model used to turn utilization and control actions into simulated power and slowdown

Quick summary

If you want the short version before the details:

Joulie Agent

Mon, 01 Jan 0001 00:00:00 +0000

The agent is Joulie’s node-side enforcement component.

It consumes desired state and applies node-local controls through configured backends.

If the operator decides “this node should now behave like eco” or “this node should stay performance”, the agent is the component that turns that intent into concrete control actions.

Responsibilities

At each reconcile tick, the agent:

identifies its node scope (single node in daemonset mode, sharded set in pool mode),
discovers local CPU/GPU hardware and runtime control capability,
publishes NodeHardware for each owned node,
reads desired target (NodePowerProfile) for each owned node,
reads telemetry/control routing (TelemetryProfile),
applies controls (host or HTTP),
exports metrics and status.

Inputs and outputs

Inputs:

Power Simulator

Mon, 01 Jan 0001 00:00:00 +0000

This page describes the simulator runtime mechanics (control/state/energy paths).

The canonical physical model, provenance, and hardware assumptions are documented in:

Hardware Modeling

For workload progression semantics:

Workload Simulator

Scope

The power simulator runtime is responsible for:

keeping per-node control state (CPU cap, DVFS throttle, GPU cap),
applying control actions from /control/{node},
updating dynamics with settling/ramp behavior,
exposing power telemetry on /telemetry/{node},
integrating energy over time (/debug/energy).

Runtime state and controls

Main node state includes:

Hardware Modeling

Mon, 01 Jan 0001 00:00:00 +0000

This simulator section now treats hardware modeling as a shared hardware concept rather than a simulator-only detail.

The canonical page is:

Hardware Modeling and Physical Power Model

Use that page for:

CPU and GPU model provenance
physical assumptions behind caps and slowdown
heterogeneous-node semantics
current limitations and calibration status

From the simulator point of view, the important relationship is simple:

the simulator implements the modeling assumptions documented there
the agent relies on the same hardware assumptions when interpreting caps and backend limits
simulator runtime pages describe how those models are exercised in experiments

For simulator-specific flow, continue with:

Policy Algorithms

Mon, 01 Jan 0001 00:00:00 +0000

This page documents the controller policy algorithms implemented in cmd/operator/main.go.

Use this page after:

Classification Input

Policy demand classification is derived from pod scheduling constraints on joulie.io/power-profile:

performance-only: pod excludes eco in required scheduling constraints.
eco-only: pod can run only on eco; advanced eco-only placement should also exclude joulie.io/draining=true.
general (implicit unconstrained): no explicit power-profile constraint, or both profiles allowed.

Shared Reconcile Flow

Each reconcile tick:

Simulator Metrics

Mon, 01 Jan 0001 00:00:00 +0000

This page documents Prometheus metrics exposed by the simulator (simulator/cmd/simulator/main.go).

Endpoint:

path: /metrics
address: simulator HTTP listen address (SIM_ADDR, default :18080)

Related debug endpoints (non-Prometheus):

/debug/nodes
/debug/events
/debug/energy

HTTP/request metrics

joulie_sim_requests_total{route,method,status} (counter)
- total HTTP requests by route/method/status
joulie_sim_request_duration_seconds{route,method} (histogram)
- request latency

Control-path metrics

joulie_sim_controls_total{node,action} (counter)
- received control actions by node/action
joulie_sim_control_actions_total{node,action,result} (counter)
- control action outcomes
- result: applied|blocked|error

Per-node simulated state metrics

joulie_sim_node_cap_watts{node} (gauge)
- current simulated effective cap
joulie_sim_node_rapl_cap_watts{node} (gauge)
- simulated RAPL cap value
joulie_sim_node_throttle_pct{node} (gauge)
- simulated DVFS throttle percent
joulie_sim_node_power_watts{node} (gauge)
- simulated exported node power
joulie_sim_node_cpu_util{node} (gauge)
- simulated CPU utilization
joulie_sim_node_freq_scale{node} (gauge)
- simulated frequency scale
joulie_sim_node_running_pods{node} (gauge)
- running pods observed on the node
joulie_sim_node_class_info{node,class} (gauge)
- class assignment marker (1 on active class)

Workload/job metrics

joulie_sim_job_submitted_total{class} (counter)
- jobs submitted by class
joulie_sim_job_completed_total{class,node} (counter)
- jobs completed by class and node
joulie_sim_job_completion_seconds (histogram)
- job completion latency distribution

Notes

Prometheus metrics capture online simulator state and request/control behavior.
Integrated node/cluster energy totals are exposed through /debug/energy (JSON), not as Prometheus time series in the current implementation.
Richer thermal and averaged-vs-instantaneous details are currently exposed through the HTTP telemetry/debug endpoints rather than as separate Prometheus gauges.
In particular, fields such as instantPackagePowerWatts, cpu.temperatureC, cpu.thermalThrottlePct, and per-device GPU averaged power live in /telemetry/{node} and /debug/nodes.

Input Telemetry and Actuation Interfaces

Mon, 01 Jan 0001 00:00:00 +0000

This page describes runtime IO contracts:

how Joulie reads telemetry inputs,
how Joulie sends control intents.

If you want the CRD-level summary first, read CRD and Policy Model. This page is the detailed runtime reference for the TelemetryProfile contract.

It is not the /metrics exposition contract. For exported metrics, see Metrics Reference.

Why this abstraction exists

Joulie must run in two worlds with the same control logic:

real hardware clusters,
simulator/KWOK clusters.

So agent/operator logic depends on provider interfaces, not directly on sysfs or simulator HTTP shape.

Metrics Reference

Mon, 01 Jan 0001 00:00:00 +0000

Joulie exposes Prometheus metrics from multiple components.

This page covers operator + agent metrics. Simulator metrics are documented separately in:

Simulator Metrics

For telemetry/control input interfaces (host/http routing), see:

Input Telemetry and Actuation Interfaces

Endpoints by component

Agent:
- path: /metrics
- default address: :8080
- env override: METRICS_ADDR
Operator:
- path: /metrics
- default address: :8081
- env override: METRICS_ADDR

Agent metrics

Backend and selected cap

joulie_backend_mode{node,mode} (gauge)
- mode: none|rapl|dvfs
- active mode is 1, others 0
joulie_policy_cap_watts{node,policy} (gauge)
- current selected policy cap in watts

RAPL power/energy

joulie_rapl_energy_uj{node,zone} (gauge)
- latest raw RAPL energy counter in microjoules
joulie_rapl_estimated_power_watts{node,zone} (gauge)
- per-zone estimated power from energy deltas
joulie_rapl_package_total_power_watts{node} (gauge)
- sum of package-level estimated power

DVFS controller

joulie_dvfs_observed_power_watts{node} (gauge)
- observed package power used by DVFS controller
joulie_dvfs_ema_power_watts{node} (gauge)
- EMA-smoothed power used for decisions
joulie_dvfs_throttle_pct{node} (gauge)
- current throttle percentage
joulie_dvfs_above_trip_count{node} (gauge)
- consecutive above-threshold samples
joulie_dvfs_below_trip_count{node} (gauge)
- consecutive below-threshold samples
joulie_dvfs_actions_total{node,action} (counter)
- action: throttle_up|throttle_down

CPU frequency observability

joulie_dvfs_cpu_cur_freq_khz{node,cpu} (gauge)
- current CPU/policy frequency in kHz
joulie_dvfs_cpu_max_freq_khz{node,cpu} (gauge)
- enforced max frequency cap in kHz

Reliability

joulie_reconcile_errors_total{node} (counter)
- reconcile-loop errors

Operator metrics

FSM state and profile label

joulie_operator_node_state{node,state} (gauge)
- state: ActivePerformance|DrainingPerformance|ActiveEco
- active state is 1, others 0
joulie_operator_node_profile_label{node,profile} (gauge)
- operator-applied node label view
- profile: performance|eco
- active profile is 1, others 0

Transition accounting

joulie_operator_state_transitions_total{node,from_state,to_state,result} (counter)
- transition events emitted by operator
- result:
  - applied: transition committed
  - deferred: transition blocked/deferred by safeguards

Heterogeneous planning

joulie_operator_node_compute_density{node,component} (gauge)
- normalized per-node density signal used for heterogeneous planning
- component: cpu|gpu
- higher values mean the operator considers that node relatively denser for that subsystem

Notes

Metrics are pull-based; values depend on scrape interval.
Highest cardinality is usually per-CPU frequency series.

CPU-Only Benchmark

Mon, 01 Jan 0001 00:00:00 +0000

This page reports results from the CPU-only cluster benchmark experiment:

experiments/01-cpu-only-benchmark/

Scope

The benchmark compares three baselines on a pure CPU cluster:

A: simulator only (Joulie-free)
B: Joulie with static partition policy
C: Joulie with queue-aware policy

It evaluates energy and throughput under real Kubernetes scheduling with KWOK nodes and simulated power control.

Experimental setup

Cluster and nodes

kind control-plane + worker (real control plane)
8 managed KWOK nodes - CPU only, no GPUs
Workload pods target KWOK nodes via selector + toleration

Node inventory

Node prefix	Count	CPU model	CPU cores	RAM
kwok-cpu-highcore	2	AMD EPYC 9965 192-Core	384 (2×192)	1536 GiB
kwok-cpu-highfreq	2	AMD EPYC 9375F 32-Core	64 (2×32)	770 GiB
kwok-cpu-intensive	4	AMD EPYC 9655 96-Core	192 (2×96)	1536 GiB

Total: 8 nodes, 2304 CPU cores, 0 GPUs.

Heterogeneous GPU Cluster Benchmark

Mon, 01 Jan 0001 00:00:00 +0000

This page reports results from the heterogeneous GPU cluster benchmark experiment:

experiments/02-heterogeneous-benchmark/

Scope

The benchmark compares three baselines on a heterogeneous cluster mixing 5 distinct GPU hardware families plus CPU-only nodes:

A: simulator only (Joulie-free)
B: Joulie with static partition policy
C: Joulie with queue-aware policy

Experimental setup

Cluster and nodes

kind control-plane + worker (real control plane)
41 managed KWOK nodes: 33 GPU nodes + 8 CPU-only nodes
Workload pods target KWOK nodes via selector + toleration

Node inventory - detailed cluster composition

This is a heterogeneous GPU cluster mixing 5 distinct GPU hardware families across 33 GPU nodes, plus 8 CPU-only nodes.

Homogeneous H100 NVL Benchmark

Mon, 01 Jan 0001 00:00:00 +0000

This page reports results from the homogeneous H100 NVL cluster benchmark experiment:

experiments/03-homogeneous-h100-benchmark/

Scope

The benchmark compares three baselines on a homogeneous cluster of NVIDIA H100 NVL GPU nodes plus CPU-only nodes:

A: simulator only (Joulie-free)
B: Joulie with static partition policy
C: Joulie with queue-aware policy

This experiment is designed for a direct comparison with the heterogeneous benchmark: same 41 total nodes, same workload configuration, but all GPU nodes are a single family (H100 NVL) instead of 5 different families.