Energy-Aware Scheduling

How Joulie combines Kepler telemetry, workload classification, digital twin predictions, and PUE-weighted scoring to make energy-aware scheduling decisions.

Joulie’s scheduler extender makes placement decisions informed by real-time energy telemetry, workload characteristics, and facility-level power conditions. This page describes the full pipeline from metrics collection through scoring and optional rescheduling.

End-to-end pipeline

The energy-aware scheduling pipeline has six stages:

Kepler/eBPF + RAPL/NVML telemetry
  -> Prometheus (scrape & store)
    -> Classifier (WorkloadProfile)
      -> Digital twin (NodeTwin.status)
        -> Scheduler extender (filter + score)
          -> Placement decision

Each stage runs independently and communicates through Kubernetes CRDs or Prometheus queries. There is no monolithic scheduling engine; each component does one thing and feeds the next.

Stage 1: Telemetry collection

Kepler instruments the Linux kernel via eBPF to produce per-container energy counters. It reads hardware energy interfaces (Intel RAPL for CPU/DRAM, NVML/DCGM for GPUs) and attributes energy consumption to individual containers using kernel tracepoints.

The agent also reads RAPL and NVML directly to obtain node-level power draw for cap enforcement and twin input.

Together, these produce three categories of signal:

Source	Granularity	Metrics
Kepler/eBPF	Per-container	`kepler_container_package_joules_total`, `kepler_container_dram_joules_total`, `kepler_container_gpu_joules_total`
RAPL	Per-socket	CPU package power, DRAM power
NVML/DCGM	Per-GPU	GPU power draw, temperature, utilization

Stage 2: Prometheus aggregation

All telemetry is scraped into Prometheus. The operator and classifier query Prometheus over configurable windows (default 10 minutes for classification, 30 seconds for twin updates). This decouples data collection from decision-making and lets each consumer query at its own cadence.

Stage 3: Workload classification

The classifier (see Workload Classification) reads Prometheus metrics and pod annotations to produce a WorkloadProfile CR for each observed workload. The profile captures:

Criticality class (performance / standard): drives hard scheduling constraints.
CPU and GPU intensity (high / medium / low): used by the twin for demand weighting.
Cap sensitivity: how much the workload would suffer under power caps.
Confidence score: how much data backs the classification.

Profiles are reclassified every 15 minutes as workload behavior evolves.

Stage 4: Digital twin computation

The operator’s twin controller ingests NodeHardware (static capabilities) and Prometheus telemetry to compute three scores per node, written to NodeTwin.status:

Power headroom (0-100): remaining power budget before hitting thermal or PSU limits.
CoolingStress (0-100): predicted fraction of cooling capacity in use.
PSUStress (0-100): predicted fraction of rack power capacity in use.

The twin also incorporates WorkloadProfile data to weight demand from high-intensity workloads more strongly. A node running several GPU-intensive training jobs counts more heavily toward performance demand than one running idle monitoring containers.

Stage 5: Scheduler scoring

The scheduler extender reads NodeTwin.status (cached with a 30-second TTL) and scores nodes using a formula that weights power headroom and facility stress:

score = headroom * 0.4 + (100 - coolingStress) * 0.3 + (100 - psuStress) * 0.3

Additional adjustments apply based on workload class, performance-node pressure, and GPU presence. See Scheduler Extender for the full scoring logic.

Stage 6: Placement

The Kubernetes scheduler combines the extender’s scores with its own resource-fit checks and places the pod. The extender never overrides Kubernetes resource accounting; it only participates in the filter and prioritize hooks.

PUE-weighted marginal scoring

Power Usage Effectiveness (PUE) measures the ratio of total facility power to IT equipment power. A PUE of 1.4 means the facility consumes 40% overhead for cooling, power distribution, and lighting. Joulie’s scoring accounts for this overhead.

Why PUE matters for scheduling

Two nodes drawing identical IT power can have different total energy costs if one is in a rack with worse cooling efficiency. Scheduling purely on IT power ignores the facility multiplier and can lead to suboptimal placement.

How Joulie incorporates PUE

The CoolingStress score in NodeTwin.status serves as a proxy for marginal PUE impact. When cooling stress is high on a node, placing additional workloads there increases the cooling system’s marginal power draw disproportionately. The scoring formula penalizes high cooling stress:

coolingPenalty = (100 - coolingStress) * 0.3

At coolingStress = 80, this contributes only 6 points (vs. 30 at coolingStress = 0). The effect is that nodes near their cooling capacity become less attractive even if they have spare compute headroom. This steers workloads toward nodes where the marginal PUE impact is lower.

The PSUStress score captures a similar dynamic at the rack power-distribution level: placing work on a rack near its PDU capacity risks triggering power management responses that degrade all workloads on that rack.

Together, the two facility-stress terms (0.3 weight each) give facility-level energy efficiency 60% of the scoring influence, while raw power headroom contributes 40%. This balance reflects the insight from energy-proportional computing research that facility overhead often dominates total cost at high utilization.

Active rescheduling (opt-in)

Scheduling decisions are made once at pod creation. As cluster conditions change, an initially good placement can become suboptimal. Joulie supports opt-in rescheduling for workloads that tolerate restarts.

How it works

The operator’s migration controller monitors NodeTwin.status for nodes where CoolingStress or PSUStress exceeds a threshold (default 70).
For pods on stressed nodes, it checks whether the associated WorkloadProfile has migratability.reschedulable: true.
If so, it writes a reschedule recommendation to NodeTwin.status.migrationRecommendations.
The recommendation identifies which pods should move and why.

Opting in

Pods opt into rescheduling with an annotation:

metadata:
  annotations:
    joulie.io/reschedulable: "true"

Without this annotation, pods are never recommended for rescheduling regardless of node stress. Performance-class pods are never rescheduled.

Active eviction (opt-in)

When ENABLE_ACTIVE_RESCHEDULING=true is set on the operator, the rescheduler acts on recommendations by evicting misplaced pods via the Kubernetes Eviction API (policy/v1). Before eviction, it annotates the pod’s owner with eviction context so the scheduler avoids re-placing the replacement pod in the same situation. See Joulie Operator for configuration details.

What rescheduling does not do

It does not reschedule performance-class workloads.
It does not trigger during normal operation; only when facility stress exceeds the configured threshold.

State of the art

Joulie’s energy-aware scheduling builds on several lines of research. This section places the design in context and identifies the relevant prior work.

Energy measurement

Kepler (Amaral et al., 2023) provides the per-container energy attribution that underpins Joulie’s classification pipeline. By attaching eBPF probes to kernel scheduling events and reading hardware energy counters, Kepler disaggregates node-level power into per-container contributions without requiring hardware modifications or hypervisor-level instrumentation.

RAPL (David et al., 2010) is the Intel hardware interface for reading and capping CPU and DRAM energy consumption. RAPL’s Running Average Power Limit model exposes per-socket energy counters and power capping through MSRs. Joulie’s agent reads RAPL for node-level telemetry and writes RAPL power limits for cap enforcement.

H. David, E. Gorbatov, U.R. Hanebutte, R. Khanna, and C. Le. “RAPL: Memory Power Estimation and Capping.” Proceedings of the International Symposium on Computer Architecture (ISCA), 2010.

M. Amaral et al. “Kepler: A Framework for Energy-Efficient Kubernetes Clusters.” Proceedings of the ACM/SPEC International Conference on Performance Engineering, 2023.

Data center energy modeling

PUE modeling has been studied extensively in the data center literature. Dayarathna et al. (2016) survey energy consumption models for data centers, covering thermal models, workload-dependent cooling, and facility-level power distribution. Joulie’s CoolingStress and PSUStress scores are simplified parametric models in this tradition, designed for O(1) per-node evaluation rather than full CFD simulation.

M. Dayarathna, Y. Wen, and R. Fan. “Data Center Energy Consumption Modeling: A Survey.” IEEE Communications Surveys & Tutorials, vol. 18, no. 1, pp. 732-794, 2016.

Energy-proportional computing

Barroso and Holzle (2007) argued that servers should consume power proportional to their utilization, noting that real servers are far from energy-proportional: an idle server still draws 50-60% of peak power. This observation motivates Joulie’s approach of consolidating workloads onto fewer, more-utilized nodes and capping idle nodes, rather than spreading load uniformly.

L.A. Barroso and U. Holzle. “The Case for Energy-Proportional Computing.” IEEE Computer, vol. 40, no. 12, pp. 33-37, 2007.

Power-aware scheduling

Fan, Weber, and Barroso (2007) demonstrated that aggregate power consumption in warehouse-scale computers can be managed through power budgeting at the cluster level. Their work on power provisioning showed that statistical multiplexing of power across many machines allows significant oversubscription of power capacity. Joulie’s PSUStress scoring applies a similar principle: it penalizes nodes that would push rack power consumption toward the provisioned limit.

X. Fan, W.-D. Weber, and L.A. Barroso. “Power Provisioning for a Warehouse-Sized Computer.” Proceedings of the International Symposium on Computer Architecture (ISCA), 2007.

Workload-aware cluster management

Tirmazi et al. (2020) describe Google’s next-generation Borg scheduler, which uses workload characterization to improve bin-packing and reduce resource stranding. Joulie’s WorkloadProfile classification serves an analogous role: by understanding whether a workload is CPU-bound, memory-bound, or GPU-intensive, the scheduler can make better placement decisions than resource-request-based scheduling alone.

M. Tirmazi, A. Barker, N. Deng, M.E. Haque, Z.G. Qin, S. Hand, M. Harchol-Balter, and J. Wilkes. “Borg: the Next Generation.” Proceedings of the European Conference on Computer Systems (EuroSys), 2020.

Digital twin for data centers

The concept of a digital twin for thermal-aware provisioning was introduced by Patel, Bash, and Sharma (2003), who proposed using thermal models to guide server placement in data centers. Joulie extends this idea to Kubernetes: the digital twin is a lightweight parametric model embedded in the operator, continuously updated from telemetry, and consumed by the scheduler in real time.

C.D. Patel, C.E. Bash, and R. Sharma. “Thermal Considerations in Cooling Large Scale High Compute Density Data Centers.” Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), 2003.

Kubernetes scheduling

Burns et al. (2016) trace the evolution from Borg through Omega to Kubernetes, describing how scheduling evolved from a monolithic model to an extensible, API-driven architecture. Joulie’s scheduler extender leverages this extensibility: it participates in the standard scheduling cycle through HTTP hooks without forking or replacing the Kubernetes scheduler.

B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes. “Borg, Omega, and Kubernetes.” ACM Queue, vol. 14, no. 1, 2016.

What Joulie adds

Prior work has addressed energy measurement, power-aware scheduling, and data center thermal modeling independently. Joulie’s contribution is integrating these into a single Kubernetes-native feedback loop:

Kepler + RAPL + NVML telemetry provides both per-container attribution and node-level power measurement.
Automated workload classification turns raw metrics into actionable profiles without manual labeling.
A digital twin predicts the facility-level impact (cooling, PSU) of placement decisions, not just IT power.
PUE-weighted scoring makes the scheduler aware of marginal facility energy cost, not just compute availability.
DRA advisory integration extends the twin’s recommendations to GPU slicing without runtime disruption.

No existing Kubernetes scheduler plugin combines real-time eBPF energy telemetry, a digital twin with facility-stress modeling, and workload classification into a single scheduling pipeline.