Docs

Experiments & Reproducibility

This page lists the current supported reproduction entry points for v0.5.1. It separates auditable benchmark workflows from archived paper-era scripts and exploratory examples.

Quick Reproduction Path

The paper-facing replay contract lives in artifacts/paper_2026/. It records commands, configs, expected outputs, tolerance bands, checksums, and pinned container recipes.

From a clean checkout:

python -m pip install -r requirements-paper.txt
python -m pip install -e .
make reproduce-paper

Run one domain:

make reproduce-dose
make reproduce-rf
make reproduce-ik
make reproduce-reference-robotics

Override output root or device:

make reproduce-dose REPRO_ROOT=results/tmp_repro REPRO_DEVICE=cpu

For paper-grade reruns, retain:

  • the git commit
  • artifacts/paper_2026/
  • the produced run directory
  • run-local manifest.json and provenance.json

Scientific Claim Benchmarks

Run smoke mode locally:

python -m zeroproofml.benchmarks dose --mode smoke --device cpu --seeds 1
python -m zeroproofml.benchmarks rf --mode smoke --device cpu --seeds 1
python -m zeroproofml.benchmarks ik --mode smoke --device cpu --seeds 1

Resume a paper-mode run:

python -m zeroproofml.benchmarks dose --mode paper --device cpu \
  --out-root results/benchmarks/dose/<run_dir> --resume

Regenerate a report:

python -m zeroproofml.report benchmark results/benchmarks/dose/<run_dir> --html-report

Expected output root:

results/benchmarks/<domain>/run_<timestamp>_<sha>/

Core artifacts:

  • seed_*/per_seed_result.json
  • manifest.json
  • provenance.json
  • resume_state.json
  • RUN_REPORT.md
  • optional RUN_REPORT.html
  • aggregated/summary.json
  • aggregated/paired_stats.json
  • aggregated/benchmark_metrics.jsonl
  • CLAIM_AUDIT.md
  • regenerated figures under figures/

Domain Notes

Domain Focus Extra artifacts
DOSE Censoring, operating points, direction-aware bottoms dose_operating_points, dose_pareto_front, diagnostics, direction-head summaries
RF Resonator response, peak retention, extrapolation signal traces, frequency-response SVGs, qualitative failure packs
IK Robotics inverse kinematics near singularities workspace heatmaps, determinant-stratified metrics, fallback route plots

Use generated artifacts for papers and reviews. Avoid hand-curated notebook summaries when a benchmark report can be regenerated from stored JSON.

Reference Robotics Deployment

End-to-end reference path:

python scripts/reference_robotics_deployment.py --device cpu --epochs 1 --n-samples 2000

Expected artifacts under results/reference_deploy_robotics/:

  • output_contract.json
  • inference_summary.json
  • strict_inference_audit.json
  • bundle/model.onnx
  • bundle/metadata.json
  • bundle/VALIDATION_REPORT.md
  • bundle/VALIDATION_REPORT.summary.json

Importable path:

from zeroproofml.reference_robotics_deployment import (
    ReferenceRoboticsDeploymentConfig,
    load_reference_robotics_deployment_artifacts,
    run_reference_robotics_deployment,
)

artifacts = run_reference_robotics_deployment(
    ReferenceRoboticsDeploymentConfig(device="cpu", epochs=1, n_samples=2000)
)
same_run = load_reference_robotics_deployment_artifacts(artifacts.out_root)

Trajectory Evaluation

Generate a stratified RR IK trajectory dataset:

python scripts/generate_reference_robotics_trajectory_data.py \
  --n-trajectories 48 --steps-per-trajectory 16

Evaluate a policy:

from zeroproofml.reference_robotics_trajectory_eval import (
    evaluate_reference_robotics_trajectory_policy,
    make_reference_robotics_dls_policy,
)

summary = evaluate_reference_robotics_trajectory_policy(
    "results/reference_robotics_trajectory_eval/rr_trajectory_eval_dataset.json",
    make_reference_robotics_dls_policy(damping=0.05),
)
print(summary["aggregate"]["mean_tracking_error"])

The returned summaries include tracking error, fallback rates, joint-limit violations, chattering events, and latency-budget violations. Provenance-aware fallback splits are included when policies tag route kinds.

Downstream Pipeline Simulator

Use the downstream simulator to test whether reject flags, provenance labels, and direction labels survive multi-step handoffs:

from zeroproofml.downstream_pipeline import (
    DownstreamPipelineReferenceSample,
    build_downstream_pipeline_simulator,
    compare_downstream_pipeline_strategies,
    write_downstream_pipeline_report,
)

simulator = build_downstream_pipeline_simulator(
    "5-step",
    drop_reject_flag_probability=0.05,
    bad_downstream_behaviors=("json_roundtrip", "aggregate_mean"),
)

comparison = compare_downstream_pipeline_strategies(
    [
        DownstreamPipelineReferenceSample(
            decoded=(-3.0,),
            should_reject=True,
            provenance="semantic",
            direction_label="below",
            sample_id="censored_low",
        ),
    ],
    simulator,
)

write_downstream_pipeline_report("artifacts/composability", result=comparison)

This is an experimental harness for composability evidence, not a stable core SCM API.

Examples Inventory

Recommended tutorial sequence:

  • examples/01_quickstart.py
  • examples/02_rational_layer.py
  • examples/03_projective_mode.py
  • examples/05_coverage_control.py
  • examples/06_export_bundle.py
  • examples/fru_strict_check_demo.py

Supported reference examples include bridge, autodiff, optimization, C++ bundle consumer, deployment workflows, and the 2R arm example.

Benchmark-helper directories include domain-specific example data and compatibility wrappers used by the scientific harness.

Archive or experimental paths include legacy Transreal-era scripts and non-promoted robotics side paths such as older 3R/6R examples.

Comparing Runs

Compare a new run with one or more baselines:

from zeroproofml.benchmarks import compare_benchmark_runs, load_benchmark_run

run = load_benchmark_run("results/benchmarks/dose/<new_run>")
comparison = compare_benchmark_runs(
    run,
    ["results/benchmarks/dose/<baseline_run>"],
)
print(comparison.to_dict())

Each run records git commit, package versions, arguments, dataset fingerprints, hardware metadata, checkpoint hashes, discovered bundle directories, and dirty-worktree state. Resumed runs preserve attempt history.

Archived Workflows

Older paper-era scripts remain in the repository for historical reference, but current docs and reproducibility claims should use the benchmark harness, reference deployment, and paper bundle above.