Digital Twin
The digital twin is Joulie’s core predictive engine. It is a lightweight O(1) parametric model that predicts the impact of scheduling and power-cap decisions on node thermal and power state, without running a full simulation for each scheduling decision.
What the digital twin computes
For each managed node, the twin produces three scores stored in NodeTwin.status:
| Signal | Range | Meaning |
|---|---|---|
| Power headroom | 0-100 | Remaining power budget before hitting thermal or PSU limits. Higher is better for new workload placement. |
| CoolingStress | 0-100 | Predicted percentage of cooling capacity in use. High values indicate the node is near its thermal limit. |
| PSUStress | 0-100 | Predicted percentage of PDU/rack power capacity in use. High values indicate the rack is near its power supply limit. |
The twin also computes:
- SchedulableClass:
performance,eco, ordraining(transition state). The scheduler extender uses this to filter and score nodes. - HardwareDensityScore: normalized compute density proxy used for heterogeneous planning.
- PowerMeasurement: a block of measured and derived power values consumed directly by the scheduler for projected headroom scoring.
PowerMeasurement output
The powerMeasurement block in NodeTwin.status provides the scheduler with the real-time power data it needs for projected headroom scoring:
| Field | Unit | Description |
|---|---|---|
source | string | Measurement source: kepler (direct), utilization (model-based), or static (estimate from caps) |
measuredNodePowerW | watts | Current total node power draw |
cpuCappedPowerW | watts | CPU power budget (cap percentage × max CPU watts) |
gpuCappedPowerW | watts | GPU power budget (cap percentage × max GPU watts) |
nodeCappedPowerW | watts | Total node power budget (CPU + GPU capped power) |
cpuTdpW | watts | CPU thermal design power (max possible) |
gpuTdpW | watts | GPU thermal design power (max possible) |
nodeTdpW | watts | Total node TDP (CPU + GPU) |
powerTrendWPerMin | watts/min | Rolling derivative of node power draw. Positive = rising, negative = falling. |
The scheduler uses measuredNodePowerW + pod marginal power to compute projected headroom relative to nodeCappedPowerW. The powerTrendWPerMin feeds the ±10 point trend bonus. See Scheduler Extender for details.
CoolingStress formula
CoolingStress is a per-node metric. It answers: “how close is this node to its cooling limit?”
Step 1: estimate node power draw
The twin does not read live telemetry. It estimates node power from hardware discovery (NodeHardware) and the current cap percentages (NodeTwin.spec):
nodePower = (cpuMaxWattsPerSocket * sockets * cpuCapPct/100)
+ (gpuMaxWatts * gpuCount * gpuCapPct/100)
cpuMaxWattsPerSocketandgpuMaxWattscome fromNodeHardware.status.cpu.capRangeandNodeHardware.status.gpu.capRange.cpuCapPctandgpuCapPctcome from the resolvedNodeTwin.specintent (defaulting to 100 if unset).
This means the twin predicts power based on what the node could draw at its current cap setting, not what it is actually drawing right now. This is intentional: the twin is a planning model, not a monitoring dashboard.
Step 2: compute cooling stress
The default LinearCoolingModel applies:
coolingStress = (nodePower / referenceNodePower) * 80 + max(0, temp - 20) * 0.5
| Term | Default | Rationale |
|---|---|---|
referenceNodePower | 4000 W | A fully loaded 2-socket EPYC 9654 + 8x H100 NVL reference node. A node drawing the reference power scores 80 at 20C, leaving 20 points of headroom for temperature. |
* 80 | Scales power into 0-80 range so that temperature can push the score above 80 toward 100. A node at 100% of reference power is stressed but not yet at capacity. | |
max(0, temp - 20) * 0.5 | baseline 20C | Each degree above 20C adds 0.5 points. At 40C ambient, temperature alone contributes 10 points. This models the reduced effectiveness of air-side cooling in warmer climates or seasons. |
The result is clamped to [0, 100].
Example: a 2-socket EPYC node with 4x H100 at eco (60% cap), 25C ambient:
- CPU: 400 W/socket * 2 * 0.6 = 480 W
- GPU: 400 W * 4 * 0.6 = 960 W
- nodePower = 1440 W
- coolingStress = (1440 / 4000) * 80 + (25 - 20) * 0.5 = 28.8 + 2.5 = 31.3
Why this model
The LinearCoolingModel is an algebraic proxy. It avoids CFD or thermal RC simulation and runs in O(1) per node. It is deliberately conservative (overestimates stress relative to real cooling capacity) because its main job is to provide a ranking signal for the scheduler, not an exact thermal prediction. Exact thermal models can be plugged in via the CoolingModel interface.
PSUStress formula
PSUStress is a cluster-level metric. It answers: “how close is this rack to its power supply limit?”
psuStress = clusterTotalPower / referenceRackCapacity * 100
| Term | Default | Rationale |
|---|---|---|
clusterTotalPower | (sum of all node power) | Total cluster power draw in watts, passed in by the operator from aggregated telemetry. |
referenceRackCapacity | 50,000 W (50 kW) | A typical single-rack PDU capacity. This is a placeholder; in production, actual PDU readings would replace it. |
The result is clamped to [0, 100].
Because this is a cluster-level signal, all nodes on the same rack see the same PSU stress score. This is intentional: a rack PDU brownout affects every node in the rack, not just the one drawing the most power.
Example: 8 nodes drawing a total of 30 kW:
- psuStress = 30000 / 50000 * 100 = 60
Power headroom
Power headroom combines cap state and cooling stress into a single “room for more work” score:
capFactor = (cpuCapPct + gpuCapPct) / 200 // 0 = fully capped, 1 = uncapped
coolingFactor = 1 - coolingStress / 100 // 0 = cooling at capacity, 1 = cool
headroom = capFactor * coolingFactor * 100
The scheduler uses headroom as the primary scoring signal (70% weight). A node with high caps and low cooling stress gets the highest headroom and attracts new workloads.
CoolingModel interface
The CoolingModel interface is pluggable:
type CoolingModel interface {
CoolingStress(nodePowerW, ambientTempC float64) float64
}
The default implementation is LinearCoolingModel, an algebraic proxy suitable for initial deployments and simulation. A future implementation will use openModelica reduced-order thermal simulation via the same interface for higher-fidelity predictions.
How it feeds the scheduler
The twin controller runs in the operator on each reconcile tick (~1 minute) and writes NodeTwin.status per managed node. The scheduler extender caches these NodeTwin CRs with a 30-second TTL and uses them in its filter and score logic:
twin controller (operator)
→ writes NodeTwin.status
→ scheduler extender cache (30s TTL)
→ filter: rejects eco nodes for performance pods
→ score: headroomScore*0.7 + (100-coolingStress)*0.15 + trendBonus + profileBonus + pressureRelief
This keeps scheduling decisions lightweight (one cache lookup per node per scheduling attempt) while reflecting the latest thermal and power state of the cluster.
How it feeds the operator
The twin also drives operator decisions:
- Transition guard: when a node is transitioning from performance to eco, the twin sets
schedulableClasstodraininguntil all performance pods have completed or been drained.
Implementation
The twin is implemented in pkg/operator/twin/twin.go. Key types:
Input: all inputs needed to compute twin state for one node (hardware, profile, cap percentages, workloads, facility signals).Output: the computedNodeTwin.statusfields.CoolingModel: pluggable interface for cooling stress computation.LinearCoolingModel: default algebraic proxy implementation.Compute(Input) Output: the main computation function.
What to read next
- Scheduler Extender
- Joulie Operator
- CRD and Policy Model
- Hardware Modeling — reference power profiles used by
NodeHardwareand the twin’s power estimation