Scheduler Extender
Joulie ships a scheduler extender that steers workloads toward appropriate nodes based on power profile, thermal stress, and hardware capabilities.
Why a scheduler component is needed
Kubernetes scheduling decisions are made before a pod is running. That is the right place to enforce Joulie’s power-profile intent, because:
- placing a performance workload on an eco (capped) node defeats energy savings and violates workload SLOs,
- placing any workload on a node under heavy thermal or PSU stress increases the risk of throttling and supply headroom violations,
- the
joulie.io/workload-classannotation is the single source of truth for placement intent; the extender enforces it without requiring users to write complex scheduling rules.
What Joulie implements: scheduler extender
Joulie implements an HTTP-based scheduler extender, not an in-tree plugin.
The extender registers with kube-scheduler through a KubeSchedulerConfiguration extender block.
The scheduler calls the extender’s HTTP endpoints as part of the normal scheduling cycle:
- filter endpoint: rejects nodes that are incompatible with the pod,
- prioritize endpoint: ranks remaining nodes by suitability.
The extender does not replace the Kubernetes scheduler. It runs as a lightweight HTTP service and participates in the existing scheduling cycle.
The scheduler extender is always deployed as part of Joulie. Without it, pods run anywhere and get standard Kubernetes scheduling.
Filter logic
The extender applies one hard rule: performance pods are rejected from eco and draining nodes.
A pod is treated as performance if it carries joulie.io/workload-class: performance.
For such pods, the extender rejects any node whose NodeTwin.status has schedulableClass set to "eco" or "draining". A label-based fallback also rejects nodes with joulie.io/power-profile: eco when no NodeTwin status is present.
Standard pods (the default, or joulie.io/workload-class: standard) pass the filter unconditionally. Unknown nodes (no NodeTwin state) are allowed for all pod classes.
Score logic
After filtering, the extender scores remaining nodes.
Base formula
score = headroom * 0.4 + (100 - coolingStress) * 0.3 + (100 - psuStress) * 0.3
Where:
headroom: available power headroom on the node (0-100), fromNodeTwin.status.predictedPowerHeadroomScore.coolingStress: predicted cooling stress (0-100), fromNodeTwin.status.predictedCoolingStressScore.psuStress: predicted PSU stress (0-100), fromNodeTwin.status.predictedPsuStressScore.
Higher scores are better. A node with high headroom and low facility stress receives the highest score.
Stale twin fallback
If the NodeTwin’s lastUpdated timestamp is older than 5 minutes (configurable via TWIN_STALENESS_THRESHOLD), the node receives a neutral score of 50. This prevents stale data from an operator that may have stopped updating from influencing placement. Nodes with no lastUpdated timestamp at all are also treated as stale.
Adaptive performance pressure relief
For standard pods on performance nodes, a pressure penalty is applied:
if workloadClass == "standard" AND schedulableClass == "performance":
score -= perfPressure * 0.3
Where perfPressure is computed once per scoring batch as the average congestion across all performance nodes:
perfPressure = average(100 - headroom) across all non-stale performance nodes
At full saturation (perfPressure = 100), this subtracts up to 30 points from the score on performance nodes. The effect steers standard pods toward eco nodes when performance nodes are congested, preserving performance capacity for performance-class workloads.
When performance nodes are idle (perfPressure = 0), there is no penalty and standard pods spread normally.
CPU-only pod GPU penalty
CPU-only pods (those not requesting nvidia.com/gpu, amd.com/gpu, or gpu.intel.com/i915) receive a -30 score penalty on GPU nodes. GPU presence is detected from cached NodeHardware CRs. This discourages CPU-only workloads from occupying GPU nodes where they waste GPU idle power.
Pods that request GPU resources do not receive this penalty.
PUE-weighted marginal power estimation
When facility metrics are enabled (ENABLE_FACILITY_METRICS=true), the operator computes PUE from real data-center metrics and writes NodeTwin.status.estimatedPUE. The scheduler extender uses this to weight marginal power estimates:
if estimatedPUE > 1.0:
deltaCPUWatts *= estimatedPUE
deltaGPUWatts *= estimatedPUE
deltaTotalWatts *= estimatedPUE
This means a pod placed on a node with PUE 1.6 is treated as costing 60% more energy than one with PUE 1.0. The effect is that the scheduler prefers nodes in more efficiently cooled parts of the facility, reducing total energy consumption including cooling overhead.
Without facility metrics, PUE defaults to 1.0 and the multiplier has no effect.
Eviction history awareness
When the active rescheduler evicts a pod, it annotates the pod’s owner (ReplicaSet or StatefulSet) with eviction context:
joulie.io/last-eviction-from-class: the schedulableClass of the node the pod was evicted from (e.g.,eco)joulie.io/last-eviction-reason: the eviction reason (e.g.,cooling_stress)joulie.io/last-eviction-time: RFC3339 timestamp
The scheduler reads these annotations when placing the replacement pod:
- Filter: if a pod’s owner was evicted from an eco node, eco and draining nodes are rejected (same as performance pod filtering). This prevents a standard pod from being re-placed on an eco node where it was previously throttled.
- Score: nodes matching the evicted-from class receive a -25 score penalty.
Eviction context expires after EVICTION_HISTORY_TTL (default 30 minutes). After expiry, the scheduler schedules normally again.
Score clamping
All scores are clamped to [0, 100]. Nodes with no NodeTwin state receive a neutral score of 50.
Data sources
The extender reads two types of Joulie CRs, both cached with a 30-second TTL to avoid hitting the API server on every scheduling decision:
- NodeTwin CRs - provide
schedulableClass,predictedPowerHeadroomScore,predictedCoolingStressScore,predictedPsuStressScore, andlastUpdatedfor filter and score decisions. - NodeHardware CRs - provide GPU presence information for the CPU-only GPU penalty.
NodeTwin.status is populated by the operator’s twin controller, which runs the digital twin model using telemetry from Prometheus and NodeHardware.
Summary
| Condition | Effect |
|---|---|
| Performance pod + eco/draining node | Hard reject (filter) |
| Standard pod + any node | Allowed (no filter) |
| Unknown node (no NodeTwin) + any pod | Allowed, neutral score (50) |
| High headroom, low stress | High score |
| Standard pod + performance node under pressure | Score penalty (up to -30) |
| CPU-only pod + GPU node | Score penalty (-30) |
| Stale or missing NodeTwin | Neutral score (50) |
| Pod owner evicted from eco class | Hard reject eco/draining (filter) + score penalty (-25) |
What the extender does not do
- It does not execute a full digital twin simulation per scheduling decision.
- It does not perform live pod migration or eviction.
- It does not override Kubernetes resource fits; it only participates in the extender filter/prioritize hooks.
- It does not make admission decisions for already-running pods.
Decisions are lightweight: one cache lookup per node per scheduling attempt.
How to deploy
The scheduler extender is deployed as part of the Joulie Helm chart.
KubeSchedulerConfiguration extender block
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
extenders:
- urlPrefix: "http://joulie-scheduler-extender.joulie-system.svc.cluster.local:9876"
filterVerb: "filter"
prioritizeVerb: "prioritize"
weight: 1
enableHTTPS: false
nodeCacheCapable: false
ignorable: true
Setting ignorable: true means the scheduler proceeds normally if the extender is temporarily unreachable.
Testing
The extender exposes a /healthz endpoint.
To verify filter decisions without a running scheduler:
curl -s -X POST \
http://localhost:9876/filter \
-H 'Content-Type: application/json' \
-d '{"pod": {...}, "nodes": {"items": [...]}}'