Policy Algorithms
This page documents the controller policy algorithms implemented in pkg/operator/policy/.
Use this page after:
Classification Input
Policy demand classification is derived from the joulie.io/workload-class pod annotation:
performance: pod carriesjoulie.io/workload-class: performance.standard(default): no annotation orjoulie.io/workload-class: standard.
Shared Reconcile Flow
Each reconcile tick:
- Select eligible nodes from
NODE_SELECTOR, excluding reserved and unschedulable nodes. - Build a hardware view from
NodeHardwarewhen available, otherwise from node labels/inventory fallback. - Sort eligible nodes by normalized compute density (highest first).
- Preserve at least one performance-capable node per discovered hardware family whenever the requested HP count allows it.
- Build a desired plan with the selected policy.
- Apply downgrade guard (sets
NodeTwin.status.schedulableClasstodrainingwhile blocking pods still run). - Write
NodeTwin.specand update thejoulie.io/power-profilenode label.
In other words, policies still decide how many high-performance nodes are needed, but the density-aware ordering influences which nodes get those assignments.
static_partition
Goal: deterministic fixed HP/LP split.
Inputs:
N: number of eligible nodes.STATIC_HP_FRAC: target fraction of high-performance nodes.
Algorithm:
hp_count = round(N * STATIC_HP_FRAC).- Clamp
hp_countto[0, N]. - Sort eligible nodes by compute density descending.
- Reserve at least one performance node per hardware family (GPU model for GPU nodes, CPU model for CPU-only nodes).
- Fill the remaining performance slots by density order.
- Remaining nodes ->
eco.
Properties:
- deterministic,
- stable over time unless node set changes.
- keeps at least some performance supply across heterogeneous hardware families.
This policy is exercised in the CPU-Only Benchmark and Heterogeneous GPU Cluster Benchmark.
queue_aware_v1
Goal: adapt HP count to current performance-only pressure.
Inputs:
N: number of eligible nodes.P: count of active performance-sensitive pods cluster-wide.QUEUE_HP_BASE_FRACQUEUE_HP_MINQUEUE_HP_MAXQUEUE_PERF_PER_HP_NODE
Algorithm:
base = round(N * QUEUE_HP_BASE_FRAC).need = ceil(P / QUEUE_PERF_PER_HP_NODE).hp_count = max(base, need).- Clamp
hp_countto[QUEUE_HP_MIN, QUEUE_HP_MAX]. - Clamp again to
[0, N]. - Reserve at least one performance node per hardware family.
- Fill the remaining performance slots by density order.
- Remaining nodes ->
eco.
Properties:
- deterministic for a fixed
(N, P), - monotonic in pressure
P, - bounded by min/max limits,
- heterogeneous-aware because denser nodes are preferred first while each family keeps some performance capacity.
This policy is exercised in the CPU-Only Benchmark and Heterogeneous GPU Cluster Benchmark.
rule_swap_v1 (debug policy)
Goal: force visible state transitions for debugging.
Algorithm:
- Compute phase from wall-clock and
RECONCILE_INTERVAL. - Alternate which of the first nodes is assigned
eco. - Others remain
performance.
This policy is intended for debugging only, not as default production behavior.
Downgrade Guard
When planned profile is eco on a node currently performance:
- Count active performance pods on that node.
- If count > 0:
- keep desired profile as
eco, - set
NodeTwin.status.schedulableClasstodraining, - record transition as deferred in operator FSM/metrics.
- keep desired profile as
- If count == 0:
- keep desired profile
eco, - set
NodeTwin.status.schedulableClasstoeco.
- keep desired profile
The scheduler extender reads schedulableClass and applies a -20 score penalty for draining nodes, discouraging new workload placement during transitions.