Architecture on Joulie

CRD and Policy Model

Mon, 01 Jan 0001 00:00:00 +0000

This page defines Joulie’s core contract:

demand comes from pod scheduling constraints,
supply is exposed by node power-profile labels,
discovered hardware is published through NodeHardware,
desired state is published through NodeTwin.

APIs

Group/version:

joulie.io/v1alpha1

CRDs:

NodeHardware (nodehardwares, cluster-scoped)
NodeTwin (nodetwins, cluster-scoped)

CRD definitions live in:

config/crd/bases/joulie.io_nodehardwares.yaml
config/crd/bases/joulie.io_nodetwins.yaml

Demand model (workloads)

Workload class is determined from the joulie.io/workload-class pod annotation:

performance demand: pod carries joulie.io/workload-class: performance.
standard demand (default): no annotation, or joulie.io/workload-class: standard. Can run on any node; adaptive scoring steers toward eco when performance nodes are congested.

Supply model (nodes)

Node supply is represented by:

Joulie Operator

Mon, 01 Jan 0001 00:00:00 +0000

The operator is Joulie’s cluster-level decision engine.

It does not write host power interfaces directly. Instead, it decides desired node states and publishes them through Kubernetes objects and labels.

In practice, the operator answers one question over and over: which nodes should currently supply performance capacity, and which can safely supply eco capacity?

Responsibilities

At each reconcile tick, the operator:

selects eligible managed nodes,
reads NodeHardware when available and falls back to node labels when it is not,
resolves hardware identity against the shared inventory,
classifies workload demand from pod scheduling constraints,
runs a policy algorithm (pkg/operator/policy/) to compute a plan,
applies transition guards for safe downgrades,
writes desired node targets (NodeTwin.spec) and the joulie.io/power-profile node label.

The agent then enforces those targets node-by-node.

Joulie Agent

Mon, 01 Jan 0001 00:00:00 +0000

The agent is Joulie’s node-side enforcement component.

It consumes desired state and applies node-local controls through configured backends.

If the operator decides “this node should now behave like eco” or “this node should stay performance”, the agent is the component that turns that intent into concrete control actions.

Responsibilities

At each reconcile tick, the agent:

identifies its node scope (single node in daemonset mode, sharded set in pool mode),
discovers local CPU/GPU hardware and runtime control capability,
publishes NodeHardware for each owned node,
reads desired target (NodeTwin.spec) for each owned node,
resolves telemetry/control backend from environment variables (default: host),
applies controls (host or HTTP),
exports metrics and status.

Inputs and outputs

Inputs:

Digital Twin

Mon, 01 Jan 0001 00:00:00 +0000

The digital twin is Joulie’s core predictive engine. It is a lightweight O(1) parametric model that predicts the impact of scheduling and power-cap decisions on node thermal and power state, without running a full simulation for each scheduling decision.

What the digital twin computes

For each managed node, the twin produces three scores stored in NodeTwin.status:

Signal	Range	Meaning
Power headroom	0-100	Remaining power budget before hitting thermal or PSU limits. Higher is better for new workload placement.
CoolingStress	0-100	Predicted percentage of cooling capacity in use. High values indicate the node is near its thermal limit.
PSUStress	0-100	Predicted percentage of PDU/rack power capacity in use. High values indicate the rack is near its power supply limit.

The twin also computes:

Policy Algorithms

Mon, 01 Jan 0001 00:00:00 +0000

This page documents the controller policy algorithms implemented in pkg/operator/policy/.

Use this page after:

Classification Input

Policy demand classification is derived from the joulie.io/workload-class pod annotation:

performance: pod carries joulie.io/workload-class: performance.
standard (default): no annotation or joulie.io/workload-class: standard.

Shared Reconcile Flow

Each reconcile tick:

Select eligible nodes from NODE_SELECTOR, excluding reserved and unschedulable nodes.
Build a hardware view from NodeHardware when available, otherwise from node labels/inventory fallback.
Sort eligible nodes by normalized compute density (highest first).
Preserve at least one performance-capable node per discovered hardware family whenever the requested HP count allows it.
Build a desired plan with the selected policy.
Apply downgrade guard (sets NodeTwin.status.schedulableClass to draining while blocking pods still run).
Write NodeTwin.spec and update the joulie.io/power-profile node label.

In other words, policies still decide how many high-performance nodes are needed, but the density-aware ordering influences which nodes get those assignments.

Scheduler Extender

Mon, 01 Jan 0001 00:00:00 +0000

Joulie ships a scheduler extender that steers workloads toward appropriate nodes based on power profile, thermal stress, and hardware capabilities.

How a pod gets scheduled (end-to-end)

When a new pod is created in the cluster, the following sequence occurs:

1. Pod created (e.g., kubectl apply, Job controller, Deployment rollout)
 |
2. kube-scheduler picks up the unscheduled pod
 |
3. kube-scheduler runs its default filters (resource fits, taints, affinity)
 |
4. kube-scheduler calls Joulie's /filter endpoint
 | - Sends: pod spec + candidate node list
 | - Joulie reads pod annotation joulie.io/workload-class
 | - Performance pods: reject nodes with schedulableClass = eco or draining
 | - Standard pods: pass all nodes
 | - Returns: filtered node list + rejection reasons
 |
5. kube-scheduler calls Joulie's /prioritize endpoint
 | - Sends: pod spec + surviving node list
 | - Joulie reads NodeTwin CRs (cached, 30s TTL) for power state
 | - Joulie reads NodeHardware CRs (cached, 30s TTL) for hardware specs
 | - Joulie extracts pod CPU/GPU requests for marginal power estimation
 | - Joulie scores each node 0-100 using the scoring formula
 | - Returns: list of (node, score) pairs
 |
6. kube-scheduler combines Joulie scores with its own plugin scores
 |
7. Pod is bound to the highest-scoring node

The extender participates in steps 4 and 5 only. It does not replace the Kubernetes scheduler — it extends it with energy-aware filter and scoring logic.

Energy-Aware Scheduling

Mon, 01 Jan 0001 00:00:00 +0000

Joulie’s scheduler extender makes placement decisions informed by real-time energy telemetry, workload characteristics, and facility-level power conditions. This page describes the full pipeline from metrics collection through scoring and optional rescheduling.

End-to-end pipeline

The energy-aware scheduling pipeline has five stages:

Kepler + RAPL/NVML telemetry
 -> Prometheus (scrape & store)
 -> Digital twin (NodeTwin.status)
 -> Scheduler extender (filter + score)
 -> Placement decision

Each stage runs independently and communicates through Kubernetes CRDs or Prometheus queries. There is no monolithic scheduling engine; each component does one thing and feeds the next.

Input Telemetry and Actuation Interfaces

Mon, 01 Jan 0001 00:00:00 +0000

This page describes runtime IO contracts:

how Joulie reads telemetry inputs,
how Joulie sends control intents.

If you want the CRD-level summary first, read CRD and Policy Model. This page is the detailed runtime reference for the telemetry and control contract.

It is not the /metrics exposition contract. For exported metrics, see Metrics Reference.

Why this abstraction exists

Joulie must run in two worlds with the same control logic:

real hardware clusters,
simulator/KWOK clusters.

So agent/operator logic depends on provider interfaces, not directly on sysfs or simulator HTTP shape.

Metrics Reference

Mon, 01 Jan 0001 00:00:00 +0000

Joulie exposes Prometheus metrics from multiple components.

This page covers operator + agent + scheduler extender metrics. Simulator metrics are documented separately in:

Simulator Metrics

For telemetry/control input interfaces (host/http routing), see:

Input Telemetry and Actuation Interfaces

Endpoints by component

Agent:
- path: /metrics
- default address: :8080
- env override: METRICS_ADDR
Operator:
- path: /metrics
- default address: :8081
- env override: METRICS_ADDR
Scheduler extender:
- path: /metrics
- default address: :9877
- env override: METRICS_ADDR

Agent metrics

Backend and selected cap

joulie_backend_mode{node,mode} (gauge)
- mode: none|rapl|dvfs
- active mode is 1, others 0
joulie_policy_cap_watts{node,policy} (gauge)
- current selected policy cap in watts

RAPL power/energy

joulie_rapl_energy_uj{node,zone} (gauge)
- latest raw RAPL energy counter in microjoules
joulie_rapl_estimated_power_watts{node,zone} (gauge)
- per-zone estimated power from energy deltas
joulie_rapl_package_total_power_watts{node} (gauge)
- sum of package-level estimated power

DVFS controller

joulie_dvfs_observed_power_watts{node} (gauge)
- observed package power used by DVFS controller
joulie_dvfs_ema_power_watts{node} (gauge)
- EMA-smoothed power used for decisions
joulie_dvfs_throttle_pct{node} (gauge)
- current throttle percentage
joulie_dvfs_above_trip_count{node} (gauge)
- consecutive above-threshold samples
joulie_dvfs_below_trip_count{node} (gauge)
- consecutive below-threshold samples
joulie_dvfs_actions_total{node,action} (counter)
- action: throttle_up|throttle_down

CPU frequency observability

joulie_dvfs_cpu_cur_freq_khz{node,cpu} (gauge)
- current CPU/policy frequency in kHz
joulie_dvfs_cpu_max_freq_khz{node,cpu} (gauge)
- enforced max frequency cap in kHz

Reliability

joulie_reconcile_errors_total{node} (counter)
- reconcile-loop errors

Operator metrics

FSM state and profile label

joulie_operator_node_state{node,state} (gauge)
- state: ActivePerformance|DrainingPerformance|ActiveEco
- active state is 1, others 0
joulie_operator_node_profile_label{node,profile} (gauge)
- operator-applied node label view
- profile: performance|eco
- active profile is 1, others 0

Transition accounting

joulie_operator_state_transitions_total{node,from_state,to_state,result} (counter)
- transition events emitted by operator
- result:
  - applied: transition committed
  - deferred: transition blocked/deferred by safeguards

Heterogeneous planning

joulie_operator_node_compute_density{node,component} (gauge)
- normalized per-node density signal used for heterogeneous planning
- component: cpu|gpu
- higher values mean the operator considers that node relatively denser for that subsystem

Scheduler extender metrics

Request counters

joulie_scheduler_filter_requests_total{workload_class} (counter)
- total filter requests by workload class
- workload_class: standard|performance
joulie_scheduler_prioritize_requests_total{workload_class} (counter)
- total prioritize (scoring) requests by workload class

Request latency

joulie_scheduler_filter_duration_seconds{workload_class} (histogram)
- time to process a filter request
joulie_scheduler_prioritize_duration_seconds{workload_class} (histogram)
- time to process a prioritize request

Scoring signals

joulie_scheduler_final_node_score{node,workload_class} (gauge)
- final scheduling score (0-100) for each node and workload class
- updated on every prioritize call; reflects the combined headroom + cooling + trend + bonus formula
joulie_scheduler_node_headroom_score{node} (gauge)
- power headroom score per node
- can go negative when projected power (measured + pod marginal) exceeds the capped budget

Data freshness

joulie_scheduler_stale_twin_data{node} (gauge)
- 1 if the NodeTwin status is older than the staleness threshold (default 5m), 0 otherwise
- a node with stale data receives a neutral score (50) instead of its computed value
- useful for alerting when the operator has stopped updating twin status

Notes

Metrics are pull-based; values depend on scrape interval.
Highest cardinality is usually per-CPU frequency series.