Configuration Reference
Complete reference for all Joulie environment variables. These are set via Helm values or directly in the Deployment/DaemonSet manifests.
Defaults listed below are the code defaults. The Helm chart (charts/joulie/values.yaml) overrides some of them — notably, the operator NODE_SELECTOR defaults to joulie.io/managed=true in the chart even though the code default is node-role.kubernetes.io/worker.
Agent
| Variable | Default | Description |
|---|
AGENT_MODE | daemonset | daemonset (one agent per node) or pool (shared agents with sharding) |
NODE_NAME | (required in daemonset mode) | Name of the node this agent manages |
RECONCILE_INTERVAL | 20s | How often the agent reconciles desired state |
METRICS_ADDR | :8080 | Address for the Prometheus metrics endpoint |
SIMULATE_ONLY | false | If true, agent discovers hardware but does not apply power caps |
HARDWARE_CATALOG_PATH | simulator/catalog/hardware.yaml | Path to the hardware inventory catalog YAML |
Agent pool mode
| Variable | Default | Description |
|---|
POOL_NODE_SELECTOR | joulie.io/managed=true | Label selector for nodes managed by pool agents |
POOL_SHARDS | 1 | Total number of shards for pool mode partitioning |
POOL_SHARD_ID | (from pod ordinal) | Shard ID for this agent instance |
Agent DVFS control
| Variable | Default | Description |
|---|
DVFS_EMA_ALPHA | 0.3 | Exponential moving average smoothing factor for power tracking |
DVFS_HIGH_MARGIN_W | 10.0 | Power above cap (watts) to trigger frequency reduction |
DVFS_LOW_MARGIN_W | 15.0 | Power below cap (watts) to trigger frequency increase |
DVFS_STEP_PCT | 10 | Frequency throttle step size (%) |
DVFS_COOLDOWN | 20s | Minimum duration between DVFS adjustments |
DVFS_TRIP_COUNT | 2 | Consecutive samples outside margin before acting |
DVFS_MIN_FREQ_KHZ | 1500000 | Floor frequency for DVFS throttling (kHz) |
Agent telemetry and control backends
| Variable | Default | Description |
|---|
TELEMETRY_CPU_SOURCE | host | CPU telemetry source: host, http, prometheus, none |
TELEMETRY_CPU_CONTROL | host | CPU control backend: host, http, none |
TELEMETRY_GPU_CONTROL | host | GPU control backend: host, http, none |
TELEMETRY_CPU_HTTP_ENDPOINT | (empty) | HTTP endpoint for CPU telemetry (e.g., http://sim:18080/telemetry/{node}) |
TELEMETRY_CPU_CONTROL_HTTP_ENDPOINT | (empty) | HTTP endpoint for CPU control (e.g., http://sim:18080/control/{node}) |
TELEMETRY_CPU_CONTROL_MODE | (empty) | CPU control mode override |
TELEMETRY_GPU_CONTROL_HTTP_ENDPOINT | (empty) | HTTP endpoint for GPU control (e.g., http://sim:18080/control/{node}) |
TELEMETRY_GPU_CONTROL_MODE | (empty) | GPU control mode override |
TELEMETRY_HTTP_TIMEOUT_SECONDS | 5 | HTTP client timeout for telemetry/control requests |
Operator
| Variable | Default | Description |
|---|
RECONCILE_INTERVAL | 1m | How often the operator reconciles cluster state |
METRICS_ADDR | :8081 | Address for the Prometheus metrics endpoint |
NODE_SELECTOR | node-role.kubernetes.io/worker | Label selector for managed nodes |
RESERVED_LABEL_KEY | joulie.io/reserved | Label key for nodes excluded from policy decisions |
POWER_PROFILE_LABEL | joulie.io/power-profile | Node label key for the active power profile |
OPERATOR_NODE_POWER_SOURCE | static | Node power data source: static, http, prometheus |
OPERATOR_NODE_POWER_HTTP_ENDPOINT | (empty) | HTTP endpoint for per-node power readings |
OPERATOR_NODE_POWER_PROMETHEUS_ADDRESS | (empty) | Prometheus address for per-node power queries |
OPERATOR_NODE_POWER_PROMETHEUS_QUERY | (empty) | PromQL query for per-node power readings |
Power cap configuration
| Variable | Default | Description |
|---|
PERFORMANCE_CAP_WATTS | 5000 | Absolute CPU power cap for performance nodes (watts) |
ECO_CAP_WATTS | 120 | Absolute CPU power cap for eco nodes (watts) |
CPU_PERFORMANCE_CAP_PCT_OF_MAX | 100 | CPU cap as percentage of max for performance nodes |
CPU_ECO_CAP_PCT_OF_MAX | 60 | CPU cap as percentage of max for eco nodes |
CPU_WRITE_ABSOLUTE_CAPS | false | If true, write absolute watts instead of percentage |
GPU_PERFORMANCE_CAP_PCT_OF_MAX | 100 | GPU cap as percentage of max for performance nodes |
GPU_ECO_CAP_PCT_OF_MAX | 60 | GPU cap as percentage of max for eco nodes |
GPU_WRITE_ABSOLUTE_CAPS | false | If true, write absolute GPU watts instead of percentage |
GPU_MODEL_CAPS_JSON | {} | JSON map of GPU model name to {"minCapWatts": N, "maxCapWatts": M} |
GPU_PRODUCT_LABEL_KEYS | joulie.io/gpu.product,... | Comma-separated node label keys to read GPU product name |
Policy configuration
| Variable | Default | Description |
|---|
POLICY_TYPE | static_partition | Policy algorithm: static_partition, queue_aware_v1, or rule_swap_v1 |
STATIC_HP_FRAC | 0.50 | Fraction of nodes allocated to performance in static_partition |
QUEUE_HP_BASE_FRAC | 0.60 | Base fraction of performance nodes in queue_aware_v1 |
QUEUE_HP_MIN | 1 | Minimum performance nodes in queue_aware_v1 |
QUEUE_HP_MAX | 1000000 | Maximum performance nodes in queue_aware_v1 |
QUEUE_PERF_PER_HP_NODE | 10 | Performance pods per performance node ratio in queue_aware_v1 |
Facility metrics
| Variable | Default | Description |
|---|
ENABLE_FACILITY_METRICS | false | Enable polling data-center-level metrics from Prometheus |
FACILITY_PROMETHEUS_ADDRESS | http://prometheus-operated.monitoring:9090 | Prometheus endpoint for facility metric queries |
FACILITY_POLL_INTERVAL | 30s | How often facility metrics are polled |
FACILITY_AMBIENT_TEMP_METRIC | datacenter_ambient_temperature_celsius | PromQL metric name for ambient temperature |
FACILITY_IT_POWER_METRIC | datacenter_total_it_power_watts | PromQL metric name for total IT power draw |
FACILITY_COOLING_POWER_METRIC | datacenter_cooling_power_watts | PromQL metric name for cooling infrastructure power |
FACILITY_ZONE_AMBIENT_METRIC_TEMPLATE | (empty) | PromQL template for per-zone ambient temperature, e.g. datacenter_ambient_temperature_celsius{zone="%s"}. Use %s as the zone name placeholder. Empty = disabled. (planned — not yet wired to env vars) |
FACILITY_RACK_POWER_METRIC_TEMPLATE | (empty) | PromQL template for per-rack power draw, e.g. datacenter_rack_power_watts{rack="%s"}. Use %s as the rack name placeholder. Empty = disabled. (planned — not yet wired to env vars) |
Node topology
Joulie supports optional per-rack PSU stress and per-zone cooling stress. This is activated by adding standard node labels:
joulie.io/rack: physical rack identifier (e.g., rack-1)joulie.io/cooling-zone: cooling zone identifier (e.g., zone-a)
When these labels are present, the operator computes PSU stress per-rack (sum of estimated node power within the rack) instead of cluster-wide, and uses per-zone ambient temperature from facility metrics instead of the global value. The twin model interfaces remain the same; topology just groups nodes for more accurate stress computation.
Nodes without topology labels fall back to cluster-wide stress computation.
Scheduler extender
| Variable | Default | Description |
|---|
EXTENDER_ADDR | :9876 | Bind address for the scheduler extender HTTP server |
METRICS_ADDR | :9877 | Address for the scheduler’s Prometheus metrics endpoint |
CACHE_TTL | 30s | TTL for the NodeTwin status cache |
TWIN_STALENESS_THRESHOLD | 5m | Duration after which NodeTwin data is considered stale |
kubectl plugin
The kubectl joulie plugin requires no configuration. It reads your current kubeconfig context.
# Install
go build -o kubectl-joulie ./cmd/kubectl-joulie
mv kubectl-joulie /usr/local/bin/
# Usage
kubectl joulie status # cluster energy overview