Configuration Reference

Complete reference for all Joulie environment variables. These are set via Helm values or directly in the Deployment/DaemonSet manifests.

Defaults listed below are the code defaults. The Helm chart (charts/joulie/values.yaml) overrides some of them — notably, the operator NODE_SELECTOR defaults to joulie.io/managed=true in the chart even though the code default is node-role.kubernetes.io/worker.

Agent

VariableDefaultDescription
AGENT_MODEdaemonsetdaemonset (one agent per node) or pool (shared agents with sharding)
NODE_NAME(required in daemonset mode)Name of the node this agent manages
RECONCILE_INTERVAL20sHow often the agent reconciles desired state
METRICS_ADDR:8080Address for the Prometheus metrics endpoint
SIMULATE_ONLYfalseIf true, agent discovers hardware but does not apply power caps
HARDWARE_CATALOG_PATHsimulator/catalog/hardware.yamlPath to the hardware inventory catalog YAML

Agent pool mode

VariableDefaultDescription
POOL_NODE_SELECTORjoulie.io/managed=trueLabel selector for nodes managed by pool agents
POOL_SHARDS1Total number of shards for pool mode partitioning
POOL_SHARD_ID(from pod ordinal)Shard ID for this agent instance

Agent DVFS control

VariableDefaultDescription
DVFS_EMA_ALPHA0.3Exponential moving average smoothing factor for power tracking
DVFS_HIGH_MARGIN_W10.0Power above cap (watts) to trigger frequency reduction
DVFS_LOW_MARGIN_W15.0Power below cap (watts) to trigger frequency increase
DVFS_STEP_PCT10Frequency throttle step size (%)
DVFS_COOLDOWN20sMinimum duration between DVFS adjustments
DVFS_TRIP_COUNT2Consecutive samples outside margin before acting
DVFS_MIN_FREQ_KHZ1500000Floor frequency for DVFS throttling (kHz)

Agent telemetry and control backends

VariableDefaultDescription
TELEMETRY_CPU_SOURCEhostCPU telemetry source: host, http, prometheus, none
TELEMETRY_CPU_CONTROLhostCPU control backend: host, http, none
TELEMETRY_GPU_CONTROLhostGPU control backend: host, http, none
TELEMETRY_CPU_HTTP_ENDPOINT(empty)HTTP endpoint for CPU telemetry (e.g., http://sim:18080/telemetry/{node})
TELEMETRY_CPU_CONTROL_HTTP_ENDPOINT(empty)HTTP endpoint for CPU control (e.g., http://sim:18080/control/{node})
TELEMETRY_CPU_CONTROL_MODE(empty)CPU control mode override
TELEMETRY_GPU_CONTROL_HTTP_ENDPOINT(empty)HTTP endpoint for GPU control (e.g., http://sim:18080/control/{node})
TELEMETRY_GPU_CONTROL_MODE(empty)GPU control mode override
TELEMETRY_HTTP_TIMEOUT_SECONDS5HTTP client timeout for telemetry/control requests

Operator

VariableDefaultDescription
RECONCILE_INTERVAL1mHow often the operator reconciles cluster state
METRICS_ADDR:8081Address for the Prometheus metrics endpoint
NODE_SELECTORnode-role.kubernetes.io/workerLabel selector for managed nodes
RESERVED_LABEL_KEYjoulie.io/reservedLabel key for nodes excluded from policy decisions
POWER_PROFILE_LABELjoulie.io/power-profileNode label key for the active power profile
OPERATOR_NODE_POWER_SOURCEstaticNode power data source: static, http, prometheus
OPERATOR_NODE_POWER_HTTP_ENDPOINT(empty)HTTP endpoint for per-node power readings
OPERATOR_NODE_POWER_PROMETHEUS_ADDRESS(empty)Prometheus address for per-node power queries
OPERATOR_NODE_POWER_PROMETHEUS_QUERY(empty)PromQL query for per-node power readings

Power cap configuration

VariableDefaultDescription
PERFORMANCE_CAP_WATTS5000Absolute CPU power cap for performance nodes (watts)
ECO_CAP_WATTS120Absolute CPU power cap for eco nodes (watts)
CPU_PERFORMANCE_CAP_PCT_OF_MAX100CPU cap as percentage of max for performance nodes
CPU_ECO_CAP_PCT_OF_MAX60CPU cap as percentage of max for eco nodes
CPU_WRITE_ABSOLUTE_CAPSfalseIf true, write absolute watts instead of percentage
GPU_PERFORMANCE_CAP_PCT_OF_MAX100GPU cap as percentage of max for performance nodes
GPU_ECO_CAP_PCT_OF_MAX60GPU cap as percentage of max for eco nodes
GPU_WRITE_ABSOLUTE_CAPSfalseIf true, write absolute GPU watts instead of percentage
GPU_MODEL_CAPS_JSON{}JSON map of GPU model name to {"minCapWatts": N, "maxCapWatts": M}
GPU_PRODUCT_LABEL_KEYSjoulie.io/gpu.product,...Comma-separated node label keys to read GPU product name

Policy configuration

VariableDefaultDescription
POLICY_TYPEstatic_partitionPolicy algorithm: static_partition, queue_aware_v1, or rule_swap_v1
STATIC_HP_FRAC0.50Fraction of nodes allocated to performance in static_partition
QUEUE_HP_BASE_FRAC0.60Base fraction of performance nodes in queue_aware_v1
QUEUE_HP_MIN1Minimum performance nodes in queue_aware_v1
QUEUE_HP_MAX1000000Maximum performance nodes in queue_aware_v1
QUEUE_PERF_PER_HP_NODE10Performance pods per performance node ratio in queue_aware_v1

Facility metrics

VariableDefaultDescription
ENABLE_FACILITY_METRICSfalseEnable polling data-center-level metrics from Prometheus
FACILITY_PROMETHEUS_ADDRESShttp://prometheus-operated.monitoring:9090Prometheus endpoint for facility metric queries
FACILITY_POLL_INTERVAL30sHow often facility metrics are polled
FACILITY_AMBIENT_TEMP_METRICdatacenter_ambient_temperature_celsiusPromQL metric name for ambient temperature
FACILITY_IT_POWER_METRICdatacenter_total_it_power_wattsPromQL metric name for total IT power draw
FACILITY_COOLING_POWER_METRICdatacenter_cooling_power_wattsPromQL metric name for cooling infrastructure power
FACILITY_ZONE_AMBIENT_METRIC_TEMPLATE(empty)PromQL template for per-zone ambient temperature, e.g. datacenter_ambient_temperature_celsius{zone="%s"}. Use %s as the zone name placeholder. Empty = disabled. (planned — not yet wired to env vars)
FACILITY_RACK_POWER_METRIC_TEMPLATE(empty)PromQL template for per-rack power draw, e.g. datacenter_rack_power_watts{rack="%s"}. Use %s as the rack name placeholder. Empty = disabled. (planned — not yet wired to env vars)

Node topology

Joulie supports optional per-rack PSU stress and per-zone cooling stress. This is activated by adding standard node labels:

  • joulie.io/rack: physical rack identifier (e.g., rack-1)
  • joulie.io/cooling-zone: cooling zone identifier (e.g., zone-a)

When these labels are present, the operator computes PSU stress per-rack (sum of estimated node power within the rack) instead of cluster-wide, and uses per-zone ambient temperature from facility metrics instead of the global value. The twin model interfaces remain the same; topology just groups nodes for more accurate stress computation.

Nodes without topology labels fall back to cluster-wide stress computation.

Scheduler extender

VariableDefaultDescription
EXTENDER_ADDR:9876Bind address for the scheduler extender HTTP server
METRICS_ADDR:9877Address for the scheduler’s Prometheus metrics endpoint
CACHE_TTL30sTTL for the NodeTwin status cache
TWIN_STALENESS_THRESHOLD5mDuration after which NodeTwin data is considered stale

kubectl plugin

The kubectl joulie plugin requires no configuration. It reads your current kubeconfig context.

# Install
go build -o kubectl-joulie ./cmd/kubectl-joulie
mv kubectl-joulie /usr/local/bin/

# Usage
kubectl joulie status      # cluster energy overview