Coleman Workflow¶

This page covers the full operational loop for running and analyzing experiments. Each section maps to a cell in the interactive marimo notebook (docs/workflow.py) — open it locally with marimo edit docs/workflow.py for a live, executable version.

Download options for the notebook source:

Browser raw view: https://raw.githubusercontent.com/jacksonpradolima/coleman/main/docs/workflow.py
Direct download with curl:

curl -L https://raw.githubusercontent.com/jacksonpradolima/coleman/main/docs/workflow.py -o workflow.py

Direct download with wget:

wget https://raw.githubusercontent.com/jacksonpradolima/coleman/main/docs/workflow.py -O workflow.py

1 — Active Configuration¶

Read the runtime configuration from run.yaml:

import yaml
from pathlib import Path

config_path = Path("run.yaml")
with open(config_path, encoding="utf-8") as f:
    config = yaml.safe_load(f) or {}

results_cfg = config.get("results", {})
checkpoint_cfg = config.get("checkpoint", {})
telemetry_cfg = config.get("telemetry", {})
experiment_cfg = config.get("experiment", {})

Key settings you should inspect:

Setting	Where it comes from
`datasets`	`experiment.datasets`
`budget.mode`	`experiment.budget.mode`
`budget.values`	`experiment.budget.values`
`results.enabled`	`results.enabled`
`results.sink`	`results.sink` (parquet / clickhouse)
`results.out_dir`	`results.out_dir`
`checkpoint.enabled`	`checkpoint.enabled`
`checkpoint.base_dir`	`checkpoint.base_dir`
`telemetry.enabled`	`telemetry.enabled`
`telemetry.otlp_endpoint`	`telemetry.otlp_endpoint`

2 — Code Cost Evaluation¶

Coleman measures code cost as a multi-dimensional scorecard with four dimensions:

Dimension	What it measures	Tools
Structural	Maintainability, complexity, change risk	Radon, Xenon, Wily
Runtime	CPU time, hotspots, memory pressure	Scalene, py-spy
Energy	Estimated energy / carbon impact	CodeCarbon, pyRAPL
Operational	Infrastructure effort proxy	All of the above

Structural cost — CI gates¶

Two gates run in CI on every pull request:

# Xenon complexity gate
python -m xenon --max-absolute C --max-modules B --max-average A coleman/

# Radon maintainability index (MI) — lists modules below the threshold;
# CI fails if this command reports such modules or exits with an error
python -m radon mi -s -n B coleman/

Running code cost checks locally¶

# All structural checks (complexity + maintainability + xenon gate)
make cost-structural

# Runtime profiling with Scalene
make cost-profile-scalene

# Energy estimation with CodeCarbon
make cost-energy

# Complexity trend analysis with Wily
make cost-wily

See Code Cost Evaluation for full documentation.

3 — Live Observability¶

Grafana is where you inspect live execution behavior:

Grafana: http://localhost:3000
OTel collector endpoint: configured via telemetry.otlp_endpoint

Use it for throughput, latency, CPU, memory, worker isolation, dataset slicing, and execution separation while the run is active.

4 — Resume / Recovery State¶

Inspect checkpoint progress files used for resume and recovery:

import json
from pathlib import Path

checkpoint_root = Path(checkpoint_cfg.get("base_dir", "checkpoints"))
checkpoint_files = sorted(checkpoint_root.glob("**/progress_*.json"))

for f in checkpoint_files:
    payload = json.loads(f.read_text(encoding="utf-8"))
    print(f"  {f.parent.name}: step={payload.get('step_committed')}")

5 — Final Results Storage¶

Final experiment facts are stored in the results sink, not in Grafana.

Parquet root: configured via results.out_dir (default ./runs)
ClickHouse sink: enabled only when results.sink = "clickhouse"
Re-running experiments in the same runs directory appends new Parquet files by default. Existing result files are preserved.

Loading results with DuckDB¶

import duckdb

parquet_glob = "./runs/**/*.parquet"

summary_df = duckdb.sql(f"""
    SELECT scenario,
           execution_id,
           experiment,
           policy,
           reward_function,
           AVG(fitness) AS avg_napfd,
           AVG(cost) AS avg_apfdc,
           AVG(prioritization_time) AS avg_prioritization_time,
           AVG(process_memory_rss_mib) AS avg_rss_mib,
           AVG(process_cpu_utilization_percent) AS avg_cpu_pct,
           MAX(wall_time_seconds) AS wall_time_seconds
    FROM read_parquet('{parquet_glob}', hive_partitioning=1)
    GROUP BY scenario, execution_id, experiment, policy, reward_function
    ORDER BY avg_napfd DESC, avg_apfdc DESC
""").df()

6 — Export¶

Export the current summary as a CSV artifact for reports:

export_dir = runs_root / "analysis"
export_dir.mkdir(parents=True, exist_ok=True)
summary_df.to_csv(export_dir / "summary.csv", index=False)

7 — Analysis Plot¶

Plot average NAPFD per policy from the persisted final results:

import matplotlib.pyplot as plt
import seaborn as sns

top_policies = (
    summary_df
    .groupby("policy", as_index=False)["avg_napfd"]
    .mean()
    .sort_values("avg_napfd", ascending=False)
)

fig, ax = plt.subplots(figsize=(10, 4))
sns.barplot(data=top_policies, x="policy", y="avg_napfd", ax=ax)
ax.set_title("Average NAPFD by Policy")
ax.set_xlabel("Policy")
ax.set_ylabel("Average NAPFD")
ax.tick_params(axis="x", rotation=45)
plt.tight_layout()

8 — Runner Extensions (hooks + extensions)¶

When you need custom domain workflows without replacing Coleman orchestration, add hooks and extensions in YAML:

hooks:
  fail_fast: false
  plugins:
     - my_project.hooks.ForecastHook

extensions:
  my_domain:
     forecast_selection:
        policy: ThompsonSampling
        reward: Binary

Recommended lifecycle contract:

Run and dataset hooks in coordinator process
Execution hooks in worker process
Keep hook code idempotent and free of global mutable state
Emit custom artifacts under each run directory when possible

on_error is emitted for execution-start failures and environment construction failures, not only for failures inside the main run body.

9 — Advanced Analyses You Can Run¶

Coleman now ships built-in analysis reports aligned with the analysis playbook.

# Quality ranking (NAPFD)
coleman analyze quality --input ./runs --format markdown --out reports/quality.md

# Stability (coefficient of variation)
coleman analyze stability --input ./runs --format csv --out reports/stability.csv

# Pareto frontier (quality vs cost)
coleman analyze pareto --input ./runs --format table

Supported report modules:

quality
cost
stability
pareto
sensitivity
resources

Use this checklist after generating enough runs:

Policy stability
- Compare mean, std, and coefficient of variation of NAPFD per policy.
Quality vs cost frontier
- Build a Pareto frontier using high NAPFD + low APFDc.
Budget sensitivity
- Compare policies per scenario / time-ratio group.
Execution variance
- Track variance between independent executions for the same policy/reward.
Operational footprint
- Compare memory and CPU metrics versus quality gains.
Custom extension impact
- Group by extension-related dimensions (from custom artifacts) and compare uplift.

See the complete guide in analysis-playbook.md.

10 — Full Extensibility in Practice¶

Coleman supports full orchestration customization patterns. Use this quick map:

hooks + extensions
- Best option for domain logic without rewriting runner flow.
New Policy and Reward
- Native extension model through module exports + YAML selection.
Custom EvaluationMetric and Environment
- Source-level extension path for deeper runtime behavior changes.

Parallel-safe implementation guidance:

Keep execution hooks worker-local and idempotent.
Keep run/dataset hooks coordinator-local for aggregation and reporting.
Persist custom artifacts with run_id + execution_id so analyses can join safely across parallel runs.

Complete implementation details and end-to-end examples:

extensibility.md

Query Snippets¶

DuckDB¶

SELECT scenario, execution_id, policy, AVG(fitness) AS avg_napfd
FROM read_parquet('./runs/**/*.parquet', hive_partitioning=1)
GROUP BY scenario, execution_id, policy
ORDER BY avg_napfd DESC;

ClickHouse¶

SELECT scenario, execution_id, policy, AVG(fitness) AS avg_napfd
FROM coleman_results
GROUP BY scenario, execution_id, policy
ORDER BY avg_napfd DESC;

Result Persistence Semantics¶

Parquet appends new files under ./runs/
ClickHouse appends new rows to coleman_results
execution_id is the safest way to isolate one run analytically
Checkpoints update the latest durable state for the same run and experiment
If you want a fresh analytical space, clean ./runs/ and optionally ./checkpoints/

Suggested Next Steps¶

Run coleman run --config run.yaml to generate fresh experiment data
Run make cost-structural to evaluate structural cost before and after changes
Run make cost-energy to compare energy impact of different implementations
Open Grafana to inspect live execution behavior while the run is active
Use the Parquet summary above for final comparisons and report export
Inspect ./checkpoints/ to verify resume and recovery progress
Switch to ClickHouse when you want a persistent analytical store instead of Parquet files

Interactive version

For an executable version of this workflow, run the marimo notebook:

marimo edit docs/workflow.py