Experiments & Benchmarks

Benchmarking protocols, results from the Physics Trinity, and reproduction guidelines.

Physics Trinity Benchmarks

The v0.4 migration to Signed Common Meadows was validated on three physics domains from scm/paper_2601.tex:

1. Asymptotic Domain: Lennard-Jones Potential

Hard-wall physics with 1/r¹² repulsion

V(r) = 4ε[(σ/r)¹² - (σ/r)⁶]

Results (extrapolation to r < 0.9σ):

ZeroProofML-SCM: 3,240× lower error vs MLP baseline
Correct asymptotic behavior preserved
Coverage: 98.7%

2. Spectral Domain: RF Bandpass Filter

Resonance peaks near ω₀

H(ω) = (jωRC) / (1 - ω²LC + jωRC)

Results (peak retention):

ZeroProofML-SCM: 1.75× higher spectral yield
Accurate Q-factor preservation
Coverage: 96.2%

3. Geometric Domain: 2-Link Inverse Kinematics

Kinematic singularities at arm extensions

θ = atan2(y, x) ± acos((x² + y² - L₁² - L₂²) / (2L₁L₂))

Results (variance at singularities):

ZeroProofML-SCM: 31.8× lower variance
Deterministic ⊥ at unreachable targets
Coverage: 94.1%

Experiment Protocol v1

Core Principles

Ceteris paribus: Same dataset, split, capacity per method
Compute budget: Measured in optimizer steps, not wall-clock
Invalid outputs: Report both MSE on valid outputs AND success rate
Hyperparameter search: Fixed budget per method family

Required Metrics

Primary:

MSE on valid predictions
Success rate (coverage)
Near-singularity bucket MSE

Secondary:

Parameter count
Training time
Inference latency
Peak memory

Baseline Taxonomy

Analytic References:

DLS (Direct Least Squares)
DLS-Adaptive

Learned Non-Projective:

MLP
MLP+PoleHead
Rational+ε
Smooth
LearnableEps
EpsEnsemble (key baseline to beat)

Learned SCM/Projective:

ZeroProofML-SCM-Basic
ZeroProofML-SCM-Full

Invalid Output Policy

Report both:

mse_valid_only: MSE on finite predictions only
success_rate: Fraction of valid outputs

Optionally:

mse_with_penalty: Treat invalid as fixed penalty

Running Benchmarks

Built-in Benchmark Suite

# Run all benchmarks
python -m zeroproof.bench --suite all --out results/

# Specific suite
python -m zeroproof.bench --suite arithmetic --out results/

# Custom iterations
python -m zeroproof.bench --suite all --iterations 1000 --samples 5

Available Suites:

arithmetic: Core SCM operations
autodiff: Gradient computation
layers: Forward passes
overhead: SCM vs IEEE comparison
torch: PyTorch integration (if installed)
jax: JAX integration (if installed)

Overhead Analysis

Compare SCM vs baseline performance:

python -m zeroproof.overhead_cli --out runs/overhead.json

Reports:

Average step time (ms)
Slowdown factor
Memory usage

Regression Testing

Compare two benchmark runs:

python -m zeroproof.bench_compare \
  --baseline results/baseline.json \
  --candidate results/current.json \
  --max-slowdown 1.20

Returns non-zero if slowdown exceeds threshold.

Reproducing Paper Results

Setup

# Clone repository
git clone https://github.com/domezsolt/zeroproofml.git
cd zeroproofml

# Create environment
python -m venv .venv
source .venv/bin/activate

# Install with all dependencies
pip install -e ".[torch,dev]"

Run Physics Trinity

# Lennard-Jones (Asymptotic)
python benchmarks/physics_trinity/lennard_jones.py --seed 42

# RF Filter (Spectral)
python benchmarks/physics_trinity/rf_filter.py --seed 42

# Inverse Kinematics (Geometric)
python benchmarks/physics_trinity/inverse_kin.py --seed 42

Expected Outputs

Each script generates:

results/[domain]_metrics.json: Quantitative results
results/[domain]_plots.pdf: Visualization
results/[domain]_coverage.txt: Coverage analysis

Verification

python scripts/verify_paper_results.py \
  --results results/ \
  --tolerance 0.05  # 5% tolerance for numerical variance

Custom Experiments

Experiment Template

import torch
from zeroproof.training import SCMTrainer, TrainingConfig
from zeroproof.losses import SCMTrainingLoss

# 1. Define problem
def generate_data(n_samples=1000):
    x = torch.linspace(-2, 2, n_samples).unsqueeze(-1)
    y = your_target_function(x)
    return x, y

# 2. Create model
from zeroproof.layers.projective_rational import (
    RRProjectiveRationalModel,
    ProjectiveRRModelConfig
)

config = ProjectiveRRModelConfig(
    input_dim=1,
    output_dim=1,
    numerator_degree=4,
    denominator_degree=3
)
model = RRProjectiveRationalModel(config)

# 3. Train
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
loss_fn = SCMTrainingLoss()

trainer = SCMTrainer(
    model=model,
    optimizer=optimizer,
    loss_fn=loss_fn,
    train_loader=train_loader,
    config=TrainingConfig(max_epochs=100)
)

history = trainer.fit()

# 4. Evaluate
from zeroproof.inference import strict_inference, InferenceConfig

model.eval()
with torch.no_grad():
    N, D = model(test_x)
    decoded, bottom, gap = strict_inference(
        N, D,
        config=InferenceConfig(tau_infer=1e-6)
    )

# 5. Report metrics
coverage = (~bottom & ~gap).float().mean()
mse_valid = ((decoded[~bottom] - test_y[~bottom])**2).mean()

print(f"Coverage: {coverage:.3f}")
print(f"MSE (valid): {mse_valid:.6f}")

Logging Results

import json
from pathlib import Path

results = {
    "experiment": "custom_singularity",
    "coverage": float(coverage),
    "mse_valid": float(mse_valid),
    "success_rate": float((~bottom).float().mean()),
    "gap_rate": float(gap.float().mean()),
    "config": {
        "tau_infer": 1e-6,
        "tau_train": 1e-4,
        "epochs": 100
    }
}

Path("results/custom.json").write_text(json.dumps(results, indent=2))

CI Integration

Automated Benchmarks

Add to .github/workflows/benchmark.yml:

- name: Run benchmarks
  run: |
    python -m zeroproof.bench --suite all --out results/ --iterations 300

- name: Compare to baseline
  run: |
    python -m zeroproof.bench_compare \
      --baseline benchmarks/baseline.json \
      --candidate results/bench.json \
      --max-slowdown 1.20

Update Baseline

After verification:

python scripts/update_benchmark_baseline.py --src results/

Visualization

Plot Coverage

import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(history['coverage'], label='Coverage')
ax.axhline(y=0.95, color='r', linestyle='--', label='Target')
ax.set_xlabel('Epoch')
ax.set_ylabel('Coverage')
ax.legend()
plt.savefig('coverage.pdf')

Plot Predictions

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Valid predictions
valid_mask = ~bottom & ~gap
ax1.scatter(test_x[valid_mask], test_y[valid_mask], alpha=0.3, label='True')
ax1.scatter(test_x[valid_mask], decoded[valid_mask], alpha=0.3, label='Predicted')
ax1.set_title('Valid Predictions')
ax1.legend()

# Bottom distribution
ax2.scatter(test_x[bottom], test_y[bottom], c='red', label='Bottom (⊥)')
ax2.scatter(test_x[gap], test_y[gap], c='orange', label='Gap')
ax2.set_title('Singular Detections')
ax2.legend()

plt.tight_layout()
plt.savefig('predictions.pdf')

Performance Benchmarks

Throughput

import time

model.eval()
n_iters = 1000

start = time.time()
with torch.no_grad():
    for _ in range(n_iters):
        _ = model(batch_x)
end = time.time()

throughput = (n_iters * batch_size) / (end - start)
print(f"Throughput: {throughput:.1f} samples/sec")

Latency

import numpy as np

latencies = []
for _ in range(100):
    start = time.perf_counter()
    with torch.no_grad():
        _ = model(single_x)
    latencies.append(time.perf_counter() - start)

print(f"p50: {np.percentile(latencies, 50)*1000:.2f}ms")
print(f"p95: {np.percentile(latencies, 95)*1000:.2f}ms")
print(f"p99: {np.percentile(latencies, 99)*1000:.2f}ms")

Next Steps

Development Guide - Debug and verify experiments
Performance Guide - Optimize benchmark runs
Training Guide - Configure experiments

Docs