Skip to content

Coleman Workflow

This page covers the full operational loop for running and analyzing experiments. Each section maps to a cell in the interactive marimo notebook (docs/workflow.py) — open it locally with marimo edit docs/workflow.py for a live, executable version.

Download options for the notebook source:

  • Browser raw view: https://raw.githubusercontent.com/jacksonpradolima/coleman/main/docs/workflow.py
  • Direct download with curl:
curl -L https://raw.githubusercontent.com/jacksonpradolima/coleman/main/docs/workflow.py -o workflow.py
  • Direct download with wget:
wget https://raw.githubusercontent.com/jacksonpradolima/coleman/main/docs/workflow.py -O workflow.py

1 — Active Configuration

Read the runtime configuration from run.yaml:

import yaml
from pathlib import Path

config_path = Path("run.yaml")
with open(config_path, encoding="utf-8") as f:
    config = yaml.safe_load(f) or {}

results_cfg = config.get("results", {})
checkpoint_cfg = config.get("checkpoint", {})
telemetry_cfg = config.get("telemetry", {})
experiment_cfg = config.get("experiment", {})

Key settings you should inspect:

Setting Where it comes from
datasets experiment.datasets
budget.mode experiment.budget.mode
budget.values experiment.budget.values
results.enabled results.enabled
results.sink results.sink (parquet / clickhouse)
results.out_dir results.out_dir
checkpoint.enabled checkpoint.enabled
checkpoint.base_dir checkpoint.base_dir
telemetry.enabled telemetry.enabled
telemetry.otlp_endpoint telemetry.otlp_endpoint

2 — Code Cost Evaluation

Coleman measures code cost as a multi-dimensional scorecard with four dimensions:

Dimension What it measures Tools
Structural Maintainability, complexity, change risk Radon, Xenon, Wily
Runtime CPU time, hotspots, memory pressure Scalene, py-spy
Energy Estimated energy / carbon impact CodeCarbon, pyRAPL
Operational Infrastructure effort proxy All of the above

Structural cost — CI gates

Two gates run in CI on every pull request:

# Xenon complexity gate
python -m xenon --max-absolute C --max-modules B --max-average A coleman/

# Radon maintainability index (MI) — lists modules below the threshold;
# CI fails if this command reports such modules or exits with an error
python -m radon mi -s -n B coleman/

Running code cost checks locally

# All structural checks (complexity + maintainability + xenon gate)
make cost-structural

# Runtime profiling with Scalene
make cost-profile-scalene

# Energy estimation with CodeCarbon
make cost-energy

# Complexity trend analysis with Wily
make cost-wily

See Code Cost Evaluation for full documentation.


3 — Live Observability

Grafana is where you inspect live execution behavior:

Use it for throughput, latency, CPU, memory, worker isolation, dataset slicing, and execution separation while the run is active.


4 — Resume / Recovery State

Inspect checkpoint progress files used for resume and recovery:

import json
from pathlib import Path

checkpoint_root = Path(checkpoint_cfg.get("base_dir", "checkpoints"))
checkpoint_files = sorted(checkpoint_root.glob("**/progress_*.json"))

for f in checkpoint_files:
    payload = json.loads(f.read_text(encoding="utf-8"))
    print(f"  {f.parent.name}: step={payload.get('step_committed')}")

5 — Final Results Storage

Final experiment facts are stored in the results sink, not in Grafana.

  • Parquet root: configured via results.out_dir (default ./runs)
  • ClickHouse sink: enabled only when results.sink = "clickhouse"
  • Re-running experiments in the same runs directory appends new Parquet files by default. Existing result files are preserved.

Loading results with DuckDB

import duckdb

parquet_glob = "./runs/**/*.parquet"

summary_df = duckdb.sql(f"""
    SELECT scenario,
           execution_id,
           experiment,
           policy,
           reward_function,
           AVG(fitness) AS avg_napfd,
           AVG(cost) AS avg_apfdc,
           AVG(prioritization_time) AS avg_prioritization_time,
           AVG(process_memory_rss_mib) AS avg_rss_mib,
           AVG(process_cpu_utilization_percent) AS avg_cpu_pct,
           MAX(wall_time_seconds) AS wall_time_seconds
    FROM read_parquet('{parquet_glob}', hive_partitioning=1)
    GROUP BY scenario, execution_id, experiment, policy, reward_function
    ORDER BY avg_napfd DESC, avg_apfdc DESC
""").df()

6 — Export

Export the current summary as a CSV artifact for reports:

export_dir = runs_root / "analysis"
export_dir.mkdir(parents=True, exist_ok=True)
summary_df.to_csv(export_dir / "summary.csv", index=False)

7 — Analysis Plot

Plot average NAPFD per policy from the persisted final results:

import matplotlib.pyplot as plt
import seaborn as sns

top_policies = (
    summary_df
    .groupby("policy", as_index=False)["avg_napfd"]
    .mean()
    .sort_values("avg_napfd", ascending=False)
)

fig, ax = plt.subplots(figsize=(10, 4))
sns.barplot(data=top_policies, x="policy", y="avg_napfd", ax=ax)
ax.set_title("Average NAPFD by Policy")
ax.set_xlabel("Policy")
ax.set_ylabel("Average NAPFD")
ax.tick_params(axis="x", rotation=45)
plt.tight_layout()

8 — Runner Extensions (hooks + extensions)

When you need custom domain workflows without replacing Coleman orchestration, add hooks and extensions in YAML:

hooks:
  fail_fast: false
  plugins:
     - my_project.hooks.ForecastHook

extensions:
  my_domain:
     forecast_selection:
        policy: ThompsonSampling
        reward: Binary

Recommended lifecycle contract:

  • Run and dataset hooks in coordinator process
  • Execution hooks in worker process
  • Keep hook code idempotent and free of global mutable state
  • Emit custom artifacts under each run directory when possible

on_error is emitted for execution-start failures and environment construction failures, not only for failures inside the main run body.


9 — Advanced Analyses You Can Run

Coleman now ships built-in analysis reports aligned with the analysis playbook.

# Quality ranking (NAPFD)
coleman analyze quality --input ./runs --format markdown --out reports/quality.md

# Stability (coefficient of variation)
coleman analyze stability --input ./runs --format csv --out reports/stability.csv

# Pareto frontier (quality vs cost)
coleman analyze pareto --input ./runs --format table

Supported report modules:

  1. quality
  2. cost
  3. stability
  4. pareto
  5. sensitivity
  6. resources

Use this checklist after generating enough runs:

  1. Policy stability
    • Compare mean, std, and coefficient of variation of NAPFD per policy.
  2. Quality vs cost frontier
    • Build a Pareto frontier using high NAPFD + low APFDc.
  3. Budget sensitivity
    • Compare policies per scenario / time-ratio group.
  4. Execution variance
    • Track variance between independent executions for the same policy/reward.
  5. Operational footprint
    • Compare memory and CPU metrics versus quality gains.
  6. Custom extension impact
    • Group by extension-related dimensions (from custom artifacts) and compare uplift.

See the complete guide in analysis-playbook.md.


10 — Full Extensibility in Practice

Coleman supports full orchestration customization patterns. Use this quick map:

  1. hooks + extensions
    • Best option for domain logic without rewriting runner flow.
  2. New Policy and Reward
    • Native extension model through module exports + YAML selection.
  3. Custom EvaluationMetric and Environment
    • Source-level extension path for deeper runtime behavior changes.

Parallel-safe implementation guidance:

  1. Keep execution hooks worker-local and idempotent.
  2. Keep run/dataset hooks coordinator-local for aggregation and reporting.
  3. Persist custom artifacts with run_id + execution_id so analyses can join safely across parallel runs.

Complete implementation details and end-to-end examples:


Query Snippets

DuckDB

SELECT scenario, execution_id, policy, AVG(fitness) AS avg_napfd
FROM read_parquet('./runs/**/*.parquet', hive_partitioning=1)
GROUP BY scenario, execution_id, policy
ORDER BY avg_napfd DESC;

ClickHouse

SELECT scenario, execution_id, policy, AVG(fitness) AS avg_napfd
FROM coleman_results
GROUP BY scenario, execution_id, policy
ORDER BY avg_napfd DESC;

Result Persistence Semantics

  • Parquet appends new files under ./runs/
  • ClickHouse appends new rows to coleman_results
  • execution_id is the safest way to isolate one run analytically
  • Checkpoints update the latest durable state for the same run and experiment
  • If you want a fresh analytical space, clean ./runs/ and optionally ./checkpoints/

Suggested Next Steps

  • Run coleman run --config run.yaml to generate fresh experiment data
  • Run make cost-structural to evaluate structural cost before and after changes
  • Run make cost-energy to compare energy impact of different implementations
  • Open Grafana to inspect live execution behavior while the run is active
  • Use the Parquet summary above for final comparisons and report export
  • Inspect ./checkpoints/ to verify resume and recovery progress
  • Switch to ClickHouse when you want a persistent analytical store instead of Parquet files

Interactive version

For an executable version of this workflow, run the marimo notebook:

marimo edit docs/workflow.py