Docs

Performance & Benchmarks

ZeroProofML performance work is split into two different concerns:

Concern Command family Output
Scientific claim benchmarks python -m zeroproofml.benchmarks ... Auditable run directories under results/benchmarks/<domain>/...
Performance microbenchmarks python perf/run_benchmarks.py and python perf/parity_runner.py Throughput/parity artifacts under benchmark_results/

Do not mix these two when reporting results. The scientific harness supports claims and reproducibility. The microbenchmark suite supports implementation tuning.

Performance Principles

SCM performance comes from keeping singularity information explicit and vectorized:

  • Use payload-plus-mask arrays instead of Python-side guard branches.
  • Prefer backend vectorized helpers such as scm_*_numpy, scm_*_torch, and scm_*_jax.
  • Track coverage and denominator statistics as metrics, not control-flow exceptions.
  • Train rational heads in projective form and decode once at the boundary.
  • Keep IEEE conversion at system boundaries.

Vectorized SCM Ops

import numpy as np
from zeroproofml.scm.ops import scm_add_numpy, scm_div_numpy

values = np.array([1.0, 0.0, -2.0])
mask = np.array([False, True, False])

num, num_mask = scm_add_numpy(values, values, mask, mask)
den, den_mask = scm_div_numpy(
    num,
    np.ones_like(num),
    num_mask,
    np.zeros_like(num_mask),
)

Boolean masks stay aligned with numeric payloads, which keeps downstream coverage and strict decode logic simple.

Dtype And Threshold Tuning

Near denominators, numeric precision affects mask stability. Recommended defaults:

  • Use float64 for near-singular rational heads unless you have measured that lower precision is stable.
  • Tune SCMRationalLayer.singular_epsilon for forward denominator detection.
  • Tune tau_train for training margins and tau_infer for deployment separately.
  • Re-run threshold sweeps after changing preprocessing, normalization, architecture, or dtype.

Scientific Benchmarks

Run smoke-mode checks:

python -m zeroproofml.benchmarks dose --mode smoke --device cpu
python -m zeroproofml.benchmarks rf --mode smoke --device cpu
python -m zeroproofml.benchmarks ik --mode smoke --device cpu

Resume paper-mode runs:

python -m zeroproofml.benchmarks dose --mode paper --device cpu \
  --out-root results/benchmarks/dose/<run_dir> --resume

Useful flags:

Flag Use
--skip-complete-seeds Reuse an output directory and fill missing canonical seed results
--force-rerun Rerun requested seeds inside an existing output root
--html-report Add RUN_REPORT.html beside the default Markdown report
--baseline-run-dir Recompute paired baseline comparisons for report regeneration

Validate and load run directories from Python:

from zeroproofml.benchmarks import (
    compare_benchmark_runs,
    load_benchmark_run,
    validate_run_dir,
)

manifest = validate_run_dir("results/benchmarks/dose/run_20260407_150800_abcd123")
run = load_benchmark_run("results/benchmarks/dose/run_20260407_150800_abcd123")
comparison = compare_benchmark_runs(run, ["results/benchmarks/dose/baseline"])

Run Directory Contract

Every current scientific benchmark run is self-contained:

Path Purpose
manifest.json Artifact index, config hash, fingerprints, checkpoint hashes
provenance.json Git, system, package versions, invocation args, dataset fingerprints
resume_state.json Resume attempt history for paper-mode runs
seed_*/per_seed_result.json Canonical per-seed raw result
aggregated/summary.json Cross-seed metrics
aggregated/paired_stats.json Paired baseline deltas when present
aggregated/benchmark_metrics.jsonl Versioned benchmark metric log
RUN_REPORT.md Human-readable report
RUN_REPORT.html Optional browser report
CLAIM_AUDIT.md Claim-gate evidence
figures/*.svg Regenerated report figures

Domain-specific additions:

  • DOSE: operating points, Pareto fronts, confusion matrices, threshold sweeps, direction-head diagnostics.
  • RF: frequency-response traces, Bode-style SVGs, qualitative failure figure packs.
  • IK: workspace heatmaps, determinant-stratified metrics, fallback-route plots when diagnostics are present.

Regenerating Reports

python -m zeroproofml.report benchmark results/benchmarks/dose/<run_dir> --html-report
python -m zeroproofml.report bundle path/to/bundle_dir
python -m zeroproofml.report training-log runs/scm_train_metrics.jsonl

Benchmark report regeneration refreshes Markdown summaries and standard SVG figures from stored artifacts. It should not require rerunning training.

Operating Points

DOSE benchmark artifacts define named operating points:

Preset Goal
safety_first Reduce false in-range predictions on censored samples
accuracy_first Reduce false censored predictions on in-range samples
direction_aware Use strict gate plus direction head for actionable bottom outputs

Use the generated aggregated/dose_operating_points.{json,md} files for published tables instead of hand-selected notebook values.

Performance Microbenchmarks

Use microbenchmarks when changing SCM kernels, vectorized mask logic, or backend parity behavior:

python perf/run_benchmarks.py
python perf/parity_runner.py

Track:

  • wall-clock throughput
  • bottom-mask throughput
  • backend parity
  • coverage and denominator statistics
  • regressions across NumPy, Torch, and JAX where relevant

Reporting Checklist

  • State whether a result is a scientific benchmark or a microbenchmark.
  • Keep manifest.json and provenance.json with every benchmark result.
  • Report the exact commit, package versions, seeds, device, and mode.
  • Use regenerated reports from stored artifacts for comparisons.
  • Do not compare current claim results against legacy artifacts that fail current schema validation.