Configuration Reference

Complete reference for all Joulie environment variables. These are set via Helm values or directly in the Deployment/DaemonSet manifests.

Agent

VariableDefaultDescription
AGENT_MODEdaemonsetdaemonset (one agent per node) or pool (shared agents with sharding)
NODE_NAME(required in daemonset mode)Name of the node this agent manages
RECONCILE_INTERVAL20sHow often the agent reconciles desired state
METRICS_ADDR:8080Address for the Prometheus metrics endpoint
SIMULATE_ONLYfalseIf true, agent discovers hardware but does not apply power caps
HARDWARE_CATALOG_PATHsimulator/catalog/hardware.yamlPath to the hardware inventory catalog YAML

Agent pool mode

VariableDefaultDescription
POOL_NODE_SELECTORnode-role.kubernetes.io/workerLabel selector for nodes managed by pool agents
POOL_SHARDS1Total number of shards for pool mode partitioning
POOL_SHARD_ID(from pod ordinal)Shard ID for this agent instance

Agent DVFS control

VariableDefaultDescription
DVFS_EMA_ALPHA0.3Exponential moving average smoothing factor for power tracking
DVFS_UPPER_MARGIN_PCT5Power above cap threshold to trigger frequency reduction (%)
DVFS_LOWER_MARGIN_PCT10Power below cap threshold to trigger frequency increase (%)
DVFS_STEP_PCT5Frequency throttle step size (%)
DVFS_COOLDOWN_S5Minimum seconds between DVFS adjustments
DVFS_TRIP_ABOVE_THRESHOLD3Consecutive above-threshold samples before throttling
DVFS_TRIP_BELOW_THRESHOLD3Consecutive below-threshold samples before unthrottling

Operator

VariableDefaultDescription
RECONCILE_INTERVAL1mHow often the operator reconciles cluster state
METRICS_ADDR:8081Address for the Prometheus metrics endpoint
NODE_SELECTORnode-role.kubernetes.io/workerLabel selector for managed nodes
RESERVED_LABEL_KEYjoulie.io/reservedLabel key for nodes excluded from policy decisions
POWER_PROFILE_LABELjoulie.io/power-profileNode label key for the active power profile

Power cap configuration

VariableDefaultDescription
PERFORMANCE_CAP_WATTS5000Absolute CPU power cap for performance nodes (watts)
ECO_CAP_WATTS120Absolute CPU power cap for eco nodes (watts)
CPU_PERFORMANCE_CAP_PCT_OF_MAX100CPU cap as percentage of max for performance nodes
CPU_ECO_CAP_PCT_OF_MAX60CPU cap as percentage of max for eco nodes
CPU_WRITE_ABSOLUTE_CAPSfalseIf true, write absolute watts instead of percentage
GPU_PERFORMANCE_CAP_PCT_OF_MAX100GPU cap as percentage of max for performance nodes
GPU_ECO_CAP_PCT_OF_MAX60GPU cap as percentage of max for eco nodes
GPU_WRITE_ABSOLUTE_CAPSfalseIf true, write absolute GPU watts instead of percentage
GPU_MODEL_CAPS_JSON{}JSON map of GPU model name to {"minCapWatts": N, "maxCapWatts": M}
GPU_PRODUCT_LABEL_KEYSjoulie.io/gpu.product,...Comma-separated node label keys to read GPU product name

Policy configuration

VariableDefaultDescription
POLICY_TYPEstatic_partitionPolicy algorithm: static_partition, queue_aware_v1, or rule_swap_v1
STATIC_HP_FRAC0.50Fraction of nodes allocated to performance in static_partition
QUEUE_HP_BASE_FRAC0.60Base fraction of performance nodes in queue_aware_v1
QUEUE_HP_MIN1Minimum performance nodes in queue_aware_v1
QUEUE_HP_MAX1000000Maximum performance nodes in queue_aware_v1
QUEUE_PERF_PER_HP_NODE10Performance pods per performance node ratio in queue_aware_v1

Workload classifier

VariableDefaultDescription
ENABLE_CLASSIFIERtrueEnable the workload classifier loop
CLASSIFY_INTERVAL30sHow often the classifier scans running pods
RECLASSIFY_INTERVAL15mMinimum time before re-classifying an already-classified pod
CLASSIFY_METRICS_WINDOW10mPrometheus query lookback window for dynamic metrics
PROMETHEUS_ADDRESShttp://prometheus-operated.monitoring:9090Prometheus endpoint for classifier queries
KEPLER_AVAILABLEtrueWhether Kepler energy metrics are available in Prometheus
CLASSIFY_MIN_CONFIDENCE0.5Minimum confidence score for a profile to influence scheduling
CLASSIFY_SIM_ANNOTATION_FALLBACKfalseWhen true, classifier reads sim.joulie.io/* pod annotations as fallback when Prometheus metrics are unavailable (simulator mode)
CLASSIFY_SIM_NOISE_PCT10Gaussian noise percentage added to sim-annotation utilization values before classification (simulates measurement error)

Active rescheduler

VariableDefaultDescription
ENABLE_ACTIVE_RESCHEDULINGfalseEnable active pod eviction for misplaced workloads
RESCHEDULE_INTERVAL60sHow often the rescheduler evaluates recommendations
RESCHEDULE_MAX_EVICTIONS_PER_NODE1Maximum pod evictions per node per rescheduler cycle
RESCHEDULE_DRY_RUNfalseIf true, log eviction decisions without executing them

Facility metrics

VariableDefaultDescription
ENABLE_FACILITY_METRICSfalseEnable polling data-center-level metrics from Prometheus
FACILITY_PROMETHEUS_ADDRESShttp://prometheus-operated.monitoring:9090Prometheus endpoint for facility metric queries
FACILITY_POLL_INTERVAL30sHow often facility metrics are polled
FACILITY_AMBIENT_TEMP_METRICdatacenter_ambient_temperature_celsiusPromQL metric name for ambient temperature
FACILITY_IT_POWER_METRICdatacenter_total_it_power_wattsPromQL metric name for total IT power draw
FACILITY_COOLING_POWER_METRICdatacenter_cooling_power_wattsPromQL metric name for cooling infrastructure power
FACILITY_ZONE_AMBIENT_METRIC_TEMPLATE(empty)PromQL template for per-zone ambient temperature, e.g. datacenter_ambient_temperature_celsius{zone="%s"}. Use %s as the zone name placeholder. Empty = disabled.
FACILITY_RACK_POWER_METRIC_TEMPLATE(empty)PromQL template for per-rack power draw, e.g. datacenter_rack_power_watts{rack="%s"}. Use %s as the rack name placeholder. Empty = disabled.

Node topology

Joulie supports optional per-rack PSU stress and per-zone cooling stress. This is activated by adding standard node labels:

  • joulie.io/rack: physical rack identifier (e.g., rack-1)
  • joulie.io/cooling-zone: cooling zone identifier (e.g., zone-a)

When these labels are present, the operator computes PSU stress per-rack (sum of estimated node power within the rack) instead of cluster-wide, and uses per-zone ambient temperature from facility metrics instead of the global value. The twin model interfaces remain the same; topology just groups nodes for more accurate stress computation.

Nodes without topology labels fall back to cluster-wide stress computation.

Scheduler extender

VariableDefaultDescription
PORT9876HTTP port for the scheduler extender
CACHE_TTL30sTTL for the NodeTwin status cache
EVICTION_HISTORY_TTL30mHow long eviction context influences scheduling decisions

kubectl plugin

The kubectl joulie plugin requires no configuration. It reads your current kubeconfig context.

# Install
go build -o kubectl-joulie ./cmd/kubectl-joulie
mv kubectl-joulie /usr/local/bin/

# Usage
kubectl joulie status      # cluster energy overview
kubectl joulie recommend   # GPU slicing and reschedule suggestions