Operator Notes
Target concept
Joulie should evolve into a centralized operator that owns the global optimization loop.
At each control step (for example every minute), the operator:
- Reads cluster-wide context.
- Decides node-to-power-profile assignments.
- Writes desired per-node state.
- Monitors outcomes and re-plans.
States start simple:
ActivePerformance(mapped to profileperformance): unconstrained / HPC-oriented.ActiveEco(mapped to profileeco): constrained / energy-saving.
Control responsibility boundary
Operator is the control-plane brain. Agent is an actuator/telemetry component.
- Operator decides what should happen: profile assignments, transitions, safety rules, timing.
- Agent decides only how to apply on host interfaces and reports result (
applied,blocked,error).
This keeps policy evolution independent from device-specific enforcement details.
Transition state machine (design baseline)
To avoid contract violations during ActivePerformance -> ActiveEco moves, use two-phase downgrade:
ActivePerformanceDrainingPerformance(keep performance cap, stop admitting new performance workloads)ActiveEco(commit eco cap when safe condition is met)
If safe condition never occurs, policy controls escalation (hold, timeout, force, or drain/evict strategy).
Current implementation includes a basic guard: when target is ActiveEco but the node still runs pods classified as performance-only from scheduling constraints (nodeSelector/required nodeAffinity on joulie.io/power-profile), downgrade is deferred and node remains in performance profile.
Pods with no power-profile scheduling constraint are classified as implicit unconstrained (general), not performance-only.
Global inputs
The operator policy has a cluster-wide view and should support:
- static metadata: node location, rack/zone, reserved nodes to exclude from optimization.
- time-based rules: business-hour peak windows and fixed schedules.
- telemetry-driven rules: temperatures, PUE, hotspot indicators, power trends.
- future data-driven policies: Prometheus-fed models, external inference (for example KServe).
Current policies
Current operator policy modules in cmd/operator/main.go:
static_partition:- deterministic split of managed nodes into
performanceandeco; - controlled by
STATIC_HP_FRAC(default0.50-> 50/50 split).
- deterministic split of managed nodes into
queue_aware_v1:- starts from a base high-performance share (
QUEUE_HP_BASE_FRAC); - raises high-performance node count when cluster-wide performance-only pod pressure grows (derived from scheduling constraints);
- bounded by
QUEUE_HP_MIN/QUEUE_HP_MAXand scaled byQUEUE_PERF_PER_HP_NODE.
- starts from a base high-performance share (
rule_swap_v1:- alternates eco/performance assignment across the first nodes on each reconcile tick;
- kept only as a debugging policy to validate transitions and control-loop wiring.
Defaults and fallback:
- default
POLICY_TYPEisstatic_partition; - default
STATIC_HP_FRACis0.50(50/50 split); - unknown
POLICY_TYPEfalls back tostatic_partition(not swap).
Extensibility model
Keep policy logic pluggable:
- a common policy interface (
Evaluate/Plan) returning node assignments. - one baseline rule-based module.
- optional telemetry/model adapters as separate modules.
The core operator loop remains stable while policy modules evolve independently.
Suggested interfaces:
PolicyModule.Plan(context) -> node transitionsContextProvider.Snapshot() -> cluster contextStateGuard.Check(node, transition) -> allowed/blocked(reason)
Input source and actuation abstraction details are defined in:
Future data-driven policies should use Prometheus (or other sources) through ContextProvider, not by changing agent APIs.
Current control path
Current path is operator-driven:
- Operator computes node assignments.
- Operator writes node-scoped desired state (
NodePowerProfile). - Agent consumes only its node-scoped assignment.
- Policy plugins can evolve independently (rule-based first, telemetry/AI later).
Suggested deployment shape
- Operator Deployment in
joulie-system. - ServiceAccount + RBAC (read nodes/metrics, write desired-state CRs).
- Leader election.
- Operator metrics endpoint (decisions, reassignments, errors, loop latency).
Future operator metrics should also expose transition outcomes (blocked, forced, completed) to make policy behavior auditable in Grafana.