Performance & Benchmarks
ZeroProofML performance work is split into two different concerns:
| Concern | Command family | Output |
|---|---|---|
| Scientific claim benchmarks | python -m zeroproofml.benchmarks ... |
Auditable run directories under results/benchmarks/<domain>/... |
| Performance microbenchmarks | python perf/run_benchmarks.py and python perf/parity_runner.py |
Throughput/parity artifacts under benchmark_results/ |
Do not mix these two when reporting results. The scientific harness supports claims and reproducibility. The microbenchmark suite supports implementation tuning.
Performance Principles
SCM performance comes from keeping singularity information explicit and vectorized:
- Use payload-plus-mask arrays instead of Python-side guard branches.
- Prefer backend vectorized helpers such as
scm_*_numpy,scm_*_torch, andscm_*_jax. - Track coverage and denominator statistics as metrics, not control-flow exceptions.
- Train rational heads in projective form and decode once at the boundary.
- Keep IEEE conversion at system boundaries.
Vectorized SCM Ops
import numpy as np
from zeroproofml.scm.ops import scm_add_numpy, scm_div_numpy
values = np.array([1.0, 0.0, -2.0])
mask = np.array([False, True, False])
num, num_mask = scm_add_numpy(values, values, mask, mask)
den, den_mask = scm_div_numpy(
num,
np.ones_like(num),
num_mask,
np.zeros_like(num_mask),
)
Boolean masks stay aligned with numeric payloads, which keeps downstream coverage and strict decode logic simple.
Dtype And Threshold Tuning
Near denominators, numeric precision affects mask stability. Recommended defaults:
- Use
float64for near-singular rational heads unless you have measured that lower precision is stable. - Tune
SCMRationalLayer.singular_epsilonfor forward denominator detection. - Tune
tau_trainfor training margins andtau_inferfor deployment separately. - Re-run threshold sweeps after changing preprocessing, normalization, architecture, or dtype.
Scientific Benchmarks
Run smoke-mode checks:
python -m zeroproofml.benchmarks dose --mode smoke --device cpu
python -m zeroproofml.benchmarks rf --mode smoke --device cpu
python -m zeroproofml.benchmarks ik --mode smoke --device cpu
Resume paper-mode runs:
python -m zeroproofml.benchmarks dose --mode paper --device cpu \
--out-root results/benchmarks/dose/<run_dir> --resume
Useful flags:
| Flag | Use |
|---|---|
--skip-complete-seeds |
Reuse an output directory and fill missing canonical seed results |
--force-rerun |
Rerun requested seeds inside an existing output root |
--html-report |
Add RUN_REPORT.html beside the default Markdown report |
--baseline-run-dir |
Recompute paired baseline comparisons for report regeneration |
Validate and load run directories from Python:
from zeroproofml.benchmarks import (
compare_benchmark_runs,
load_benchmark_run,
validate_run_dir,
)
manifest = validate_run_dir("results/benchmarks/dose/run_20260407_150800_abcd123")
run = load_benchmark_run("results/benchmarks/dose/run_20260407_150800_abcd123")
comparison = compare_benchmark_runs(run, ["results/benchmarks/dose/baseline"])
Run Directory Contract
Every current scientific benchmark run is self-contained:
| Path | Purpose |
|---|---|
manifest.json |
Artifact index, config hash, fingerprints, checkpoint hashes |
provenance.json |
Git, system, package versions, invocation args, dataset fingerprints |
resume_state.json |
Resume attempt history for paper-mode runs |
seed_*/per_seed_result.json |
Canonical per-seed raw result |
aggregated/summary.json |
Cross-seed metrics |
aggregated/paired_stats.json |
Paired baseline deltas when present |
aggregated/benchmark_metrics.jsonl |
Versioned benchmark metric log |
RUN_REPORT.md |
Human-readable report |
RUN_REPORT.html |
Optional browser report |
CLAIM_AUDIT.md |
Claim-gate evidence |
figures/*.svg |
Regenerated report figures |
Domain-specific additions:
- DOSE: operating points, Pareto fronts, confusion matrices, threshold sweeps, direction-head diagnostics.
- RF: frequency-response traces, Bode-style SVGs, qualitative failure figure packs.
- IK: workspace heatmaps, determinant-stratified metrics, fallback-route plots when diagnostics are present.
Regenerating Reports
python -m zeroproofml.report benchmark results/benchmarks/dose/<run_dir> --html-report
python -m zeroproofml.report bundle path/to/bundle_dir
python -m zeroproofml.report training-log runs/scm_train_metrics.jsonl
Benchmark report regeneration refreshes Markdown summaries and standard SVG figures from stored artifacts. It should not require rerunning training.
Operating Points
DOSE benchmark artifacts define named operating points:
| Preset | Goal |
|---|---|
safety_first |
Reduce false in-range predictions on censored samples |
accuracy_first |
Reduce false censored predictions on in-range samples |
direction_aware |
Use strict gate plus direction head for actionable bottom outputs |
Use the generated aggregated/dose_operating_points.{json,md} files for published tables instead of hand-selected notebook values.
Performance Microbenchmarks
Use microbenchmarks when changing SCM kernels, vectorized mask logic, or backend parity behavior:
python perf/run_benchmarks.py
python perf/parity_runner.py
Track:
- wall-clock throughput
- bottom-mask throughput
- backend parity
- coverage and denominator statistics
- regressions across NumPy, Torch, and JAX where relevant
Reporting Checklist
- State whether a result is a scientific benchmark or a microbenchmark.
- Keep
manifest.jsonandprovenance.jsonwith every benchmark result. - Report the exact commit, package versions, seeds, device, and mode.
- Use regenerated reports from stored artifacts for comparisons.
- Do not compare current claim results against legacy artifacts that fail current schema validation.