Configuration Reference

Complete reference for all Joulie environment variables. These are set via Helm values or directly in the Deployment/DaemonSet manifests.

Agent

Variable	Default	Description
`AGENT_MODE`	`daemonset`	`daemonset` (one agent per node) or `pool` (shared agents with sharding)
`NODE_NAME`	(required in daemonset mode)	Name of the node this agent manages
`RECONCILE_INTERVAL`	`20s`	How often the agent reconciles desired state
`METRICS_ADDR`	`:8080`	Address for the Prometheus metrics endpoint
`SIMULATE_ONLY`	`false`	If `true`, agent discovers hardware but does not apply power caps
`HARDWARE_CATALOG_PATH`	`simulator/catalog/hardware.yaml`	Path to the hardware inventory catalog YAML

Agent pool mode

Variable	Default	Description
`POOL_NODE_SELECTOR`	`node-role.kubernetes.io/worker`	Label selector for nodes managed by pool agents
`POOL_SHARDS`	`1`	Total number of shards for pool mode partitioning
`POOL_SHARD_ID`	(from pod ordinal)	Shard ID for this agent instance

Agent DVFS control

Variable	Default	Description
`DVFS_EMA_ALPHA`	`0.3`	Exponential moving average smoothing factor for power tracking
`DVFS_UPPER_MARGIN_PCT`	`5`	Power above cap threshold to trigger frequency reduction (%)
`DVFS_LOWER_MARGIN_PCT`	`10`	Power below cap threshold to trigger frequency increase (%)
`DVFS_STEP_PCT`	`5`	Frequency throttle step size (%)
`DVFS_COOLDOWN_S`	`5`	Minimum seconds between DVFS adjustments
`DVFS_TRIP_ABOVE_THRESHOLD`	`3`	Consecutive above-threshold samples before throttling
`DVFS_TRIP_BELOW_THRESHOLD`	`3`	Consecutive below-threshold samples before unthrottling

Operator

Variable	Default	Description
`RECONCILE_INTERVAL`	`1m`	How often the operator reconciles cluster state
`METRICS_ADDR`	`:8081`	Address for the Prometheus metrics endpoint
`NODE_SELECTOR`	`node-role.kubernetes.io/worker`	Label selector for managed nodes
`RESERVED_LABEL_KEY`	`joulie.io/reserved`	Label key for nodes excluded from policy decisions
`POWER_PROFILE_LABEL`	`joulie.io/power-profile`	Node label key for the active power profile

Power cap configuration

Variable	Default	Description
`PERFORMANCE_CAP_WATTS`	`5000`	Absolute CPU power cap for performance nodes (watts)
`ECO_CAP_WATTS`	`120`	Absolute CPU power cap for eco nodes (watts)
`CPU_PERFORMANCE_CAP_PCT_OF_MAX`	`100`	CPU cap as percentage of max for performance nodes
`CPU_ECO_CAP_PCT_OF_MAX`	`60`	CPU cap as percentage of max for eco nodes
`CPU_WRITE_ABSOLUTE_CAPS`	`false`	If `true`, write absolute watts instead of percentage
`GPU_PERFORMANCE_CAP_PCT_OF_MAX`	`100`	GPU cap as percentage of max for performance nodes
`GPU_ECO_CAP_PCT_OF_MAX`	`60`	GPU cap as percentage of max for eco nodes
`GPU_WRITE_ABSOLUTE_CAPS`	`false`	If `true`, write absolute GPU watts instead of percentage
`GPU_MODEL_CAPS_JSON`	`{}`	JSON map of GPU model name to `{"minCapWatts": N, "maxCapWatts": M}`
`GPU_PRODUCT_LABEL_KEYS`	`joulie.io/gpu.product,...`	Comma-separated node label keys to read GPU product name

Policy configuration

Variable	Default	Description
`POLICY_TYPE`	`static_partition`	Policy algorithm: `static_partition`, `queue_aware_v1`, or `rule_swap_v1`
`STATIC_HP_FRAC`	`0.50`	Fraction of nodes allocated to performance in `static_partition`
`QUEUE_HP_BASE_FRAC`	`0.60`	Base fraction of performance nodes in `queue_aware_v1`
`QUEUE_HP_MIN`	`1`	Minimum performance nodes in `queue_aware_v1`
`QUEUE_HP_MAX`	`1000000`	Maximum performance nodes in `queue_aware_v1`
`QUEUE_PERF_PER_HP_NODE`	`10`	Performance pods per performance node ratio in `queue_aware_v1`

Workload classifier

Variable	Default	Description
`ENABLE_CLASSIFIER`	`true`	Enable the workload classifier loop
`CLASSIFY_INTERVAL`	`30s`	How often the classifier scans running pods
`RECLASSIFY_INTERVAL`	`15m`	Minimum time before re-classifying an already-classified pod
`CLASSIFY_METRICS_WINDOW`	`10m`	Prometheus query lookback window for dynamic metrics
`PROMETHEUS_ADDRESS`	`http://prometheus-operated.monitoring:9090`	Prometheus endpoint for classifier queries
`KEPLER_AVAILABLE`	`true`	Whether Kepler energy metrics are available in Prometheus
`CLASSIFY_MIN_CONFIDENCE`	`0.5`	Minimum confidence score for a profile to influence scheduling
`CLASSIFY_SIM_ANNOTATION_FALLBACK`	`false`	When `true`, classifier reads `sim.joulie.io/*` pod annotations as fallback when Prometheus metrics are unavailable (simulator mode)
`CLASSIFY_SIM_NOISE_PCT`	`10`	Gaussian noise percentage added to sim-annotation utilization values before classification (simulates measurement error)

Active rescheduler

Variable	Default	Description
`ENABLE_ACTIVE_RESCHEDULING`	`false`	Enable active pod eviction for misplaced workloads
`RESCHEDULE_INTERVAL`	`60s`	How often the rescheduler evaluates recommendations
`RESCHEDULE_MAX_EVICTIONS_PER_NODE`	`1`	Maximum pod evictions per node per rescheduler cycle
`RESCHEDULE_DRY_RUN`	`false`	If `true`, log eviction decisions without executing them

Facility metrics

Variable	Default	Description
`ENABLE_FACILITY_METRICS`	`false`	Enable polling data-center-level metrics from Prometheus
`FACILITY_PROMETHEUS_ADDRESS`	`http://prometheus-operated.monitoring:9090`	Prometheus endpoint for facility metric queries
`FACILITY_POLL_INTERVAL`	`30s`	How often facility metrics are polled
`FACILITY_AMBIENT_TEMP_METRIC`	`datacenter_ambient_temperature_celsius`	PromQL metric name for ambient temperature
`FACILITY_IT_POWER_METRIC`	`datacenter_total_it_power_watts`	PromQL metric name for total IT power draw
`FACILITY_COOLING_POWER_METRIC`	`datacenter_cooling_power_watts`	PromQL metric name for cooling infrastructure power
`FACILITY_ZONE_AMBIENT_METRIC_TEMPLATE`	(empty)	PromQL template for per-zone ambient temperature, e.g. `datacenter_ambient_temperature_celsius{zone="%s"}`. Use `%s` as the zone name placeholder. Empty = disabled.
`FACILITY_RACK_POWER_METRIC_TEMPLATE`	(empty)	PromQL template for per-rack power draw, e.g. `datacenter_rack_power_watts{rack="%s"}`. Use `%s` as the rack name placeholder. Empty = disabled.

Node topology

Joulie supports optional per-rack PSU stress and per-zone cooling stress. This is activated by adding standard node labels:

joulie.io/rack: physical rack identifier (e.g., rack-1)
joulie.io/cooling-zone: cooling zone identifier (e.g., zone-a)

When these labels are present, the operator computes PSU stress per-rack (sum of estimated node power within the rack) instead of cluster-wide, and uses per-zone ambient temperature from facility metrics instead of the global value. The twin model interfaces remain the same; topology just groups nodes for more accurate stress computation.

Nodes without topology labels fall back to cluster-wide stress computation.

Scheduler extender

Variable	Default	Description
`PORT`	`9876`	HTTP port for the scheduler extender
`CACHE_TTL`	`30s`	TTL for the NodeTwin status cache
`EVICTION_HISTORY_TTL`	`30m`	How long eviction context influences scheduling decisions

kubectl plugin

The kubectl joulie plugin requires no configuration. It reads your current kubeconfig context.

# Install
go build -o kubectl-joulie ./cmd/kubectl-joulie
mv kubectl-joulie /usr/local/bin/

# Usage
kubectl joulie status      # cluster energy overview
kubectl joulie recommend   # GPU slicing and reschedule suggestions