Observability¶
This page describes the optional local observability stack for debugging and profiling Coleman4HCS experiments.
Framework-first guarantee
python main.py works without Docker or any of these services.
The observability stack is optional for local installs, but
enabled automatically in the DevContainer.
DevContainer: zero-step setup¶
If you develop inside the DevContainer, everything is already running. The container automatically:
- Installs the
telemetryandclickhousepip extras - Starts OTel Collector + Prometheus + Grafana + ClickHouse via Docker Compose
- Enables
[telemetry] enabled = trueinconfig.toml
Just run your experiment:
Grafana datasource and dashboard are provisioned automatically.
ClickHouse is also available at http://localhost:8123.
Parallel execution is also telemetry-safe now: each worker builds an isolated
environment and emits its own execution_id and worker_id, so concurrent
runs can be separated in Grafana and in persisted results.
The default dashboard now includes top-level filters for dataset, execution, and policy, so you can slice one live run without visually mixing it with the others.
The top row also includes a "Current Cycle By Active Experiment" panel next to "Active Runs", so you can see optimization stage progression while runs are still active.
The dashboard now has a dedicated Sched Time Ratio filter (label time_ratio)
so you can isolate runs by available CI budget percentage.
For operational analysis, use the snapshot panels that focus on current iteration/stage instead of time-based curves:
Current Cycle By Active ExperimentProgress To Target Steps (%)Convergence Signal (Current NAPFD)Checkpoint Save Rate (last 5m)
The agent/system panels in the overview are also configured as instant snapshots (table/gauge/stat), which removes the time-axis dependency when the goal is to inspect current simulation stage.
Snapshot table panels now include an explicit current-step column (merged from
cycles_total) beside the metric value, so each metric row can be interpreted
in the context of the current simulation iteration.
Live behavior vs. final results¶
Grafana answers: "what is happening right now?"
For final experiment results, use the persisted outputs instead:
./runs/for partitioned Parquet datasetscoleman_resultsin ClickHouse when the ClickHouse sink is enabled./checkpoints/to inspect resume/recovery progress files
The complete guided workflow is available in workflow.py.
Local setup (without DevContainer)¶
# Base stack (OTel Collector + Prometheus + Grafana)
cd examples/observability
docker compose up -d
# Install telemetry extras
uv pip install coleman4hcs[telemetry]
# Enable telemetry in config.toml:
# [telemetry]
# enabled = true
# Run your experiment
uv run python main.py
Endpoints¶
- Grafana:
http://localhost:3000 - Prometheus UI:
http://localhost:9090 - Collector exporter (scrape target):
http://localhost:8889/metrics
With ClickHouse¶
Tear down¶
Metric names¶
| Metric | Type | Description |
|---|---|---|
coleman.cycles_total |
Counter | Total experiment cycles processed |
coleman.bandit_update_latency |
Histogram (s) | Bandit arm-update latency |
coleman.prioritization_latency |
Histogram (s) | Test-case prioritization latency |
coleman.evaluation_latency |
Histogram (s) | Evaluation step latency |
coleman.napfd |
Histogram | NAPFD score distribution |
coleman.apfdc |
Histogram | APFDc score distribution |
coleman.process_memory_rss |
Histogram (MiB) | Resident memory sampled during execution |
coleman.process_memory_peak_rss |
Histogram (MiB) | Peak resident memory seen by the process |
coleman.process_cpu_utilization |
Histogram (%) | CPU utilization sampled during execution |
coleman.experiment_wall_time |
Histogram (s) | End-to-end elapsed time of one experiment |
coleman.experiment_cpu_time |
Histogram (s) | CPU time consumed by one experiment |
Cardinality rules¶
- No
steplabel in metrics (would create unbounded cardinality). execution_idandworker_idare metric labels on purpose so parallel runs stay separable.- Per-step detail is available in traces (span attributes).
Resource metrics and energy¶
The default stack now exposes CPU and memory cost directly in telemetry and in persisted results.
Energy is intentionally not emitted by default because, in containers and Codespaces, reliable hardware counters are often unavailable. If you want energy metrics, the recommended next step is to integrate host-specific sources such as Intel RAPL, NVIDIA NVML, or a node exporter running on the host.
Example DuckDB queries over Parquet results¶
-- Average NAPFD per policy
SELECT policy, AVG(fitness) AS avg_napfd
FROM read_parquet('./runs/**/*.parquet', hive_partitioning=1)
GROUP BY policy
ORDER BY avg_napfd DESC;
-- Cost distribution per reward function
SELECT reward_function,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY cost) AS median_cost
FROM read_parquet('./runs/**/*.parquet', hive_partitioning=1)
GROUP BY reward_function;