Energy-Aware Scheduling
Joulie’s scheduler extender makes placement decisions informed by real-time energy telemetry, workload characteristics, and facility-level power conditions. This page describes the full pipeline from metrics collection through scoring and optional rescheduling.
End-to-end pipeline
The energy-aware scheduling pipeline has six stages:
Kepler/eBPF + RAPL/NVML telemetry
-> Prometheus (scrape & store)
-> Classifier (WorkloadProfile)
-> Digital twin (NodeTwin.status)
-> Scheduler extender (filter + score)
-> Placement decision
Each stage runs independently and communicates through Kubernetes CRDs or Prometheus queries. There is no monolithic scheduling engine; each component does one thing and feeds the next.
Stage 1: Telemetry collection
Kepler instruments the Linux kernel via eBPF to produce per-container energy counters. It reads hardware energy interfaces (Intel RAPL for CPU/DRAM, NVML/DCGM for GPUs) and attributes energy consumption to individual containers using kernel tracepoints.
The agent also reads RAPL and NVML directly to obtain node-level power draw for cap enforcement and twin input.
Together, these produce three categories of signal:
| Source | Granularity | Metrics |
|---|---|---|
| Kepler/eBPF | Per-container | kepler_container_package_joules_total, kepler_container_dram_joules_total, kepler_container_gpu_joules_total |
| RAPL | Per-socket | CPU package power, DRAM power |
| NVML/DCGM | Per-GPU | GPU power draw, temperature, utilization |
Stage 2: Prometheus aggregation
All telemetry is scraped into Prometheus. The operator and classifier query Prometheus over configurable windows (default 10 minutes for classification, 30 seconds for twin updates). This decouples data collection from decision-making and lets each consumer query at its own cadence.
Stage 3: Workload classification
The classifier (see Workload Classification) reads Prometheus metrics and pod annotations to produce a WorkloadProfile CR for each observed workload. The profile captures:
- Criticality class (
performance/standard): drives hard scheduling constraints. - CPU and GPU intensity (
high/medium/low): used by the twin for demand weighting. - Cap sensitivity: how much the workload would suffer under power caps.
- Confidence score: how much data backs the classification.
Profiles are reclassified every 15 minutes as workload behavior evolves.
Stage 4: Digital twin computation
The operator’s twin controller ingests NodeHardware (static capabilities) and Prometheus telemetry to compute three scores per node, written to NodeTwin.status:
- Power headroom (0-100): remaining power budget before hitting thermal or PSU limits.
- CoolingStress (0-100): predicted fraction of cooling capacity in use.
- PSUStress (0-100): predicted fraction of rack power capacity in use.
The twin also incorporates WorkloadProfile data to weight demand from high-intensity workloads more strongly. A node running several GPU-intensive training jobs counts more heavily toward performance demand than one running idle monitoring containers.
Stage 5: Scheduler scoring
The scheduler extender reads NodeTwin.status (cached with a 30-second TTL) and scores nodes using a formula that weights power headroom and facility stress:
score = headroom * 0.4 + (100 - coolingStress) * 0.3 + (100 - psuStress) * 0.3
Additional adjustments apply based on workload class, performance-node pressure, and GPU presence. See Scheduler Extender for the full scoring logic.
Stage 6: Placement
The Kubernetes scheduler combines the extender’s scores with its own resource-fit checks and places the pod. The extender never overrides Kubernetes resource accounting; it only participates in the filter and prioritize hooks.
PUE-weighted marginal scoring
Power Usage Effectiveness (PUE) measures the ratio of total facility power to IT equipment power. A PUE of 1.4 means the facility consumes 40% overhead for cooling, power distribution, and lighting. Joulie’s scoring accounts for this overhead.
Why PUE matters for scheduling
Two nodes drawing identical IT power can have different total energy costs if one is in a rack with worse cooling efficiency. Scheduling purely on IT power ignores the facility multiplier and can lead to suboptimal placement.
How Joulie incorporates PUE
The CoolingStress score in NodeTwin.status serves as a proxy for marginal PUE impact. When cooling stress is high on a node, placing additional workloads there increases the cooling system’s marginal power draw disproportionately. The scoring formula penalizes high cooling stress:
coolingPenalty = (100 - coolingStress) * 0.3
At coolingStress = 80, this contributes only 6 points (vs. 30 at coolingStress = 0). The effect is that nodes near their cooling capacity become less attractive even if they have spare compute headroom. This steers workloads toward nodes where the marginal PUE impact is lower.
The PSUStress score captures a similar dynamic at the rack power-distribution level: placing work on a rack near its PDU capacity risks triggering power management responses that degrade all workloads on that rack.
Together, the two facility-stress terms (0.3 weight each) give facility-level energy efficiency 60% of the scoring influence, while raw power headroom contributes 40%. This balance reflects the insight from energy-proportional computing research that facility overhead often dominates total cost at high utilization.
Active rescheduling (opt-in)
Scheduling decisions are made once at pod creation. As cluster conditions change, an initially good placement can become suboptimal. Joulie supports opt-in rescheduling for workloads that tolerate restarts.
How it works
- The operator’s migration controller monitors
NodeTwin.statusfor nodes where CoolingStress or PSUStress exceeds a threshold (default 70). - For pods on stressed nodes, it checks whether the associated
WorkloadProfilehasmigratability.reschedulable: true. - If so, it writes a reschedule recommendation to
NodeTwin.status.migrationRecommendations. - The recommendation identifies which pods should move and why.
Opting in
Pods opt into rescheduling with an annotation:
metadata:
annotations:
joulie.io/reschedulable: "true"
Without this annotation, pods are never recommended for rescheduling regardless of node stress. Performance-class pods are never rescheduled.
Active eviction (opt-in)
When ENABLE_ACTIVE_RESCHEDULING=true is set on the operator, the rescheduler acts on recommendations by evicting misplaced pods via the Kubernetes Eviction API (policy/v1). Before eviction, it annotates the pod’s owner with eviction context so the scheduler avoids re-placing the replacement pod in the same situation. See Joulie Operator for configuration details.
What rescheduling does not do
- It does not reschedule performance-class workloads.
- It does not trigger during normal operation; only when facility stress exceeds the configured threshold.
State of the art
Joulie’s energy-aware scheduling builds on several lines of research. This section places the design in context and identifies the relevant prior work.
Energy measurement
Kepler (Amaral et al., 2023) provides the per-container energy attribution that underpins Joulie’s classification pipeline. By attaching eBPF probes to kernel scheduling events and reading hardware energy counters, Kepler disaggregates node-level power into per-container contributions without requiring hardware modifications or hypervisor-level instrumentation.
RAPL (David et al., 2010) is the Intel hardware interface for reading and capping CPU and DRAM energy consumption. RAPL’s Running Average Power Limit model exposes per-socket energy counters and power capping through MSRs. Joulie’s agent reads RAPL for node-level telemetry and writes RAPL power limits for cap enforcement.
H. David, E. Gorbatov, U.R. Hanebutte, R. Khanna, and C. Le. “RAPL: Memory Power Estimation and Capping.” Proceedings of the International Symposium on Computer Architecture (ISCA), 2010.
M. Amaral et al. “Kepler: A Framework for Energy-Efficient Kubernetes Clusters.” Proceedings of the ACM/SPEC International Conference on Performance Engineering, 2023.
Data center energy modeling
PUE modeling has been studied extensively in the data center literature. Dayarathna et al. (2016) survey energy consumption models for data centers, covering thermal models, workload-dependent cooling, and facility-level power distribution. Joulie’s CoolingStress and PSUStress scores are simplified parametric models in this tradition, designed for O(1) per-node evaluation rather than full CFD simulation.
M. Dayarathna, Y. Wen, and R. Fan. “Data Center Energy Consumption Modeling: A Survey.” IEEE Communications Surveys & Tutorials, vol. 18, no. 1, pp. 732-794, 2016.
Energy-proportional computing
Barroso and Holzle (2007) argued that servers should consume power proportional to their utilization, noting that real servers are far from energy-proportional: an idle server still draws 50-60% of peak power. This observation motivates Joulie’s approach of consolidating workloads onto fewer, more-utilized nodes and capping idle nodes, rather than spreading load uniformly.
L.A. Barroso and U. Holzle. “The Case for Energy-Proportional Computing.” IEEE Computer, vol. 40, no. 12, pp. 33-37, 2007.
Power-aware scheduling
Fan, Weber, and Barroso (2007) demonstrated that aggregate power consumption in warehouse-scale computers can be managed through power budgeting at the cluster level. Their work on power provisioning showed that statistical multiplexing of power across many machines allows significant oversubscription of power capacity. Joulie’s PSUStress scoring applies a similar principle: it penalizes nodes that would push rack power consumption toward the provisioned limit.
X. Fan, W.-D. Weber, and L.A. Barroso. “Power Provisioning for a Warehouse-Sized Computer.” Proceedings of the International Symposium on Computer Architecture (ISCA), 2007.
Workload-aware cluster management
Tirmazi et al. (2020) describe Google’s next-generation Borg scheduler, which uses workload characterization to improve bin-packing and reduce resource stranding. Joulie’s WorkloadProfile classification serves an analogous role: by understanding whether a workload is CPU-bound, memory-bound, or GPU-intensive, the scheduler can make better placement decisions than resource-request-based scheduling alone.
M. Tirmazi, A. Barker, N. Deng, M.E. Haque, Z.G. Qin, S. Hand, M. Harchol-Balter, and J. Wilkes. “Borg: the Next Generation.” Proceedings of the European Conference on Computer Systems (EuroSys), 2020.
Digital twin for data centers
The concept of a digital twin for thermal-aware provisioning was introduced by Patel, Bash, and Sharma (2003), who proposed using thermal models to guide server placement in data centers. Joulie extends this idea to Kubernetes: the digital twin is a lightweight parametric model embedded in the operator, continuously updated from telemetry, and consumed by the scheduler in real time.
C.D. Patel, C.E. Bash, and R. Sharma. “Thermal Considerations in Cooling Large Scale High Compute Density Data Centers.” Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), 2003.
Kubernetes scheduling
Burns et al. (2016) trace the evolution from Borg through Omega to Kubernetes, describing how scheduling evolved from a monolithic model to an extensible, API-driven architecture. Joulie’s scheduler extender leverages this extensibility: it participates in the standard scheduling cycle through HTTP hooks without forking or replacing the Kubernetes scheduler.
B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes. “Borg, Omega, and Kubernetes.” ACM Queue, vol. 14, no. 1, 2016.
What Joulie adds
Prior work has addressed energy measurement, power-aware scheduling, and data center thermal modeling independently. Joulie’s contribution is integrating these into a single Kubernetes-native feedback loop:
- Kepler + RAPL + NVML telemetry provides both per-container attribution and node-level power measurement.
- Automated workload classification turns raw metrics into actionable profiles without manual labeling.
- A digital twin predicts the facility-level impact (cooling, PSU) of placement decisions, not just IT power.
- PUE-weighted scoring makes the scheduler aware of marginal facility energy cost, not just compute availability.
- DRA advisory integration extends the twin’s recommendations to GPU slicing without runtime disruption.
No existing Kubernetes scheduler plugin combines real-time eBPF energy telemetry, a digital twin with facility-stress modeling, and workload classification into a single scheduling pipeline.