Joulie Operator
The operator is Joulie’s cluster-level decision engine.
It does not write host power interfaces directly. Instead, it decides desired node states and publishes them through Kubernetes objects and labels.
In practice, the operator answers one question over and over:
which nodes should currently supply performance capacity, and which can safely supply eco capacity?
Responsibilities
At each reconcile tick, the operator:
- selects eligible managed nodes,
- reads
NodeHardwarewhen available and falls back to node labels when it is not, - resolves hardware identity against the shared inventory,
- classifies workload demand from pod scheduling constraints,
- runs a policy algorithm to compute a plan,
- applies transition guards for safe downgrades,
- writes desired node targets (
NodePowerProfile) and node supply labels.
The agent then enforces those targets node-by-node.
Control boundary with the agent
- operator decides what each node should be
- agent decides how to apply the corresponding controls on that node
This separation keeps policy logic portable while actuator details stay node-local.
Reconcile flow
- Read nodes matching
NODE_SELECTOR(chart default:joulie.io/managed=true). - Ignore reserved/unschedulable nodes.
- Build a normalized hardware view:
- prefer
NodeHardware - otherwise derive hardware identity from node labels / allocatable resources
- resolve CPU/GPU models against the inventory
- compute per-node CPU/GPU density signals
- prefer
- Build demand view from active pods:
- performance-constrained
- eco-constrained
- unconstrained
- Sort eligible nodes by normalized compute density (CPU + GPU), highest first.
- Run policy (
static_partition,queue_aware_v1, or debugrule_swap_v1). - For planned
performance -> ecotransitions, run downgrade guard:- publish
profile=ecoas desired state - keep
joulie.io/draining=truewhile performance-sensitive pods are still present
- publish
- Persist desired state through
NodePowerProfileand update node labels:joulie.io/power-profilejoulie.io/draining
The important distinction is:
NodePowerProfileexpresses desired target state for enforcement,- node labels express scheduler-facing supply state during transitions.
Power intent configuration knobs
Operator intent emission is controlled by env vars:
- CPU:
CPU_WRITE_ABSOLUTE_CAPS(true|false)CPU_PERFORMANCE_CAP_PCT_OF_MAXCPU_ECO_CAP_PCT_OF_MAXPERFORMANCE_CAP_WATTSECO_CAP_WATTS
- GPU:
GPU_PERFORMANCE_CAP_PCT_OF_MAXGPU_ECO_CAP_PCT_OF_MAXGPU_WRITE_ABSOLUTE_CAPS(true|false)GPU_MODEL_CAPS_JSONGPU_PRODUCT_LABEL_KEYS
High-level behavior:
- CPU:
- when
CPU_WRITE_ABSOLUTE_CAPS=false, operator writes normalized percentage intent, - when
CPU_WRITE_ABSOLUTE_CAPS=true, operator writes absolute watts intent.
- when
- GPU:
- when
GPU_WRITE_ABSOLUTE_CAPS=false, operator writes percentage intent, - when
GPU_WRITE_ABSOLUTE_CAPS=true, operator may write resolvedcapWattsPerGpuin addition tocapPctOfMax, when model-based mapping is available.
- when
This is why GPU NodePowerProfile objects may contain both normalized intent and resolved absolute caps at the same time.
Heterogeneous planning
The operator is now inventory-aware.
Its first heterogeneous-planning input is a normalized compute-density score built from:
- recognized CPU model + socket/core shape
- recognized GPU model + GPU count
This score is used to order eligible nodes before policy assignment.
So, for the same policy parameters, denser nodes are preferred first for performance supply.
If NodeHardware is not available yet:
- the operator derives a best-effort hardware view from labels such as
joulie.io/hw.cpu-model,joulie.io/hw.gpu-model,joulie.io/hw.gpu-count, - and from allocatable extended resources (
nvidia.com/gpu,amd.com/gpu).
That keeps simulator-first and bootstrap scenarios working without making NodeHardware a hand-authored prerequisite.
Node state model
Joulie models two scheduler-facing supply states:
performanceeco
DrainingPerformance is an internal operator FSM state tracked while profile=eco and joulie.io/draining=true.
That state means:
- the operator wants the node to end up in eco,
- the transition is still guarded because performance-sensitive pods are present,
- advanced eco-only placement can avoid the node until draining clears by excluding
joulie.io/draining=true.
Why this model
- scheduler gets clear supply signal from node labels,
- policy can evolve independently of host control implementation,
- transitions are auditable and safer than instant downgrade.