Experiments & Benchmarks
Benchmarking protocols, results from the Physics Trinity, and reproduction guidelines.
Physics Trinity Benchmarks
The v0.4 migration to Signed Common Meadows was validated on three physics domains from scm/paper_2601.tex:
1. Asymptotic Domain: Lennard-Jones Potential
Hard-wall physics with 1/r¹² repulsion
V(r) = 4ε[(σ/r)¹² - (σ/r)⁶]
Results (extrapolation to r < 0.9σ):
- ZeroProofML-SCM: 3,240× lower error vs MLP baseline
- Correct asymptotic behavior preserved
- Coverage: 98.7%
2. Spectral Domain: RF Bandpass Filter
Resonance peaks near ω₀
H(ω) = (jωRC) / (1 - ω²LC + jωRC)
Results (peak retention):
- ZeroProofML-SCM: 1.75× higher spectral yield
- Accurate Q-factor preservation
- Coverage: 96.2%
3. Geometric Domain: 2-Link Inverse Kinematics
Kinematic singularities at arm extensions
θ = atan2(y, x) ± acos((x² + y² - L₁² - L₂²) / (2L₁L₂))
Results (variance at singularities):
- ZeroProofML-SCM: 31.8× lower variance
- Deterministic ⊥ at unreachable targets
- Coverage: 94.1%
Experiment Protocol v1
Core Principles
- Ceteris paribus: Same dataset, split, capacity per method
- Compute budget: Measured in optimizer steps, not wall-clock
- Invalid outputs: Report both MSE on valid outputs AND success rate
- Hyperparameter search: Fixed budget per method family
Required Metrics
Primary:
- MSE on valid predictions
- Success rate (coverage)
- Near-singularity bucket MSE
Secondary:
- Parameter count
- Training time
- Inference latency
- Peak memory
Baseline Taxonomy
Analytic References:
- DLS (Direct Least Squares)
- DLS-Adaptive
Learned Non-Projective:
- MLP
- MLP+PoleHead
- Rational+ε
- Smooth
- LearnableEps
- EpsEnsemble (key baseline to beat)
Learned SCM/Projective:
- ZeroProofML-SCM-Basic
- ZeroProofML-SCM-Full
Invalid Output Policy
Report both:
mse_valid_only: MSE on finite predictions onlysuccess_rate: Fraction of valid outputs
Optionally:
mse_with_penalty: Treat invalid as fixed penalty
Running Benchmarks
Built-in Benchmark Suite
# Run all benchmarks
python -m zeroproof.bench --suite all --out results/
# Specific suite
python -m zeroproof.bench --suite arithmetic --out results/
# Custom iterations
python -m zeroproof.bench --suite all --iterations 1000 --samples 5
Available Suites:
arithmetic: Core SCM operationsautodiff: Gradient computationlayers: Forward passesoverhead: SCM vs IEEE comparisontorch: PyTorch integration (if installed)jax: JAX integration (if installed)
Overhead Analysis
Compare SCM vs baseline performance:
python -m zeroproof.overhead_cli --out runs/overhead.json
Reports:
- Average step time (ms)
- Slowdown factor
- Memory usage
Regression Testing
Compare two benchmark runs:
python -m zeroproof.bench_compare \
--baseline results/baseline.json \
--candidate results/current.json \
--max-slowdown 1.20
Returns non-zero if slowdown exceeds threshold.
Reproducing Paper Results
Setup
# Clone repository
git clone https://github.com/domezsolt/zeroproofml.git
cd zeroproofml
# Create environment
python -m venv .venv
source .venv/bin/activate
# Install with all dependencies
pip install -e ".[torch,dev]"
Run Physics Trinity
# Lennard-Jones (Asymptotic)
python benchmarks/physics_trinity/lennard_jones.py --seed 42
# RF Filter (Spectral)
python benchmarks/physics_trinity/rf_filter.py --seed 42
# Inverse Kinematics (Geometric)
python benchmarks/physics_trinity/inverse_kin.py --seed 42
Expected Outputs
Each script generates:
results/[domain]_metrics.json: Quantitative resultsresults/[domain]_plots.pdf: Visualizationresults/[domain]_coverage.txt: Coverage analysis
Verification
python scripts/verify_paper_results.py \
--results results/ \
--tolerance 0.05 # 5% tolerance for numerical variance
Custom Experiments
Experiment Template
import torch
from zeroproof.training import SCMTrainer, TrainingConfig
from zeroproof.losses import SCMTrainingLoss
# 1. Define problem
def generate_data(n_samples=1000):
x = torch.linspace(-2, 2, n_samples).unsqueeze(-1)
y = your_target_function(x)
return x, y
# 2. Create model
from zeroproof.layers.projective_rational import (
RRProjectiveRationalModel,
ProjectiveRRModelConfig
)
config = ProjectiveRRModelConfig(
input_dim=1,
output_dim=1,
numerator_degree=4,
denominator_degree=3
)
model = RRProjectiveRationalModel(config)
# 3. Train
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
loss_fn = SCMTrainingLoss()
trainer = SCMTrainer(
model=model,
optimizer=optimizer,
loss_fn=loss_fn,
train_loader=train_loader,
config=TrainingConfig(max_epochs=100)
)
history = trainer.fit()
# 4. Evaluate
from zeroproof.inference import strict_inference, InferenceConfig
model.eval()
with torch.no_grad():
N, D = model(test_x)
decoded, bottom, gap = strict_inference(
N, D,
config=InferenceConfig(tau_infer=1e-6)
)
# 5. Report metrics
coverage = (~bottom & ~gap).float().mean()
mse_valid = ((decoded[~bottom] - test_y[~bottom])**2).mean()
print(f"Coverage: {coverage:.3f}")
print(f"MSE (valid): {mse_valid:.6f}")
Logging Results
import json
from pathlib import Path
results = {
"experiment": "custom_singularity",
"coverage": float(coverage),
"mse_valid": float(mse_valid),
"success_rate": float((~bottom).float().mean()),
"gap_rate": float(gap.float().mean()),
"config": {
"tau_infer": 1e-6,
"tau_train": 1e-4,
"epochs": 100
}
}
Path("results/custom.json").write_text(json.dumps(results, indent=2))
CI Integration
Automated Benchmarks
Add to .github/workflows/benchmark.yml:
- name: Run benchmarks
run: |
python -m zeroproof.bench --suite all --out results/ --iterations 300
- name: Compare to baseline
run: |
python -m zeroproof.bench_compare \
--baseline benchmarks/baseline.json \
--candidate results/bench.json \
--max-slowdown 1.20
Update Baseline
After verification:
python scripts/update_benchmark_baseline.py --src results/
Visualization
Plot Coverage
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(history['coverage'], label='Coverage')
ax.axhline(y=0.95, color='r', linestyle='--', label='Target')
ax.set_xlabel('Epoch')
ax.set_ylabel('Coverage')
ax.legend()
plt.savefig('coverage.pdf')
Plot Predictions
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Valid predictions
valid_mask = ~bottom & ~gap
ax1.scatter(test_x[valid_mask], test_y[valid_mask], alpha=0.3, label='True')
ax1.scatter(test_x[valid_mask], decoded[valid_mask], alpha=0.3, label='Predicted')
ax1.set_title('Valid Predictions')
ax1.legend()
# Bottom distribution
ax2.scatter(test_x[bottom], test_y[bottom], c='red', label='Bottom (⊥)')
ax2.scatter(test_x[gap], test_y[gap], c='orange', label='Gap')
ax2.set_title('Singular Detections')
ax2.legend()
plt.tight_layout()
plt.savefig('predictions.pdf')
Performance Benchmarks
Throughput
import time
model.eval()
n_iters = 1000
start = time.time()
with torch.no_grad():
for _ in range(n_iters):
_ = model(batch_x)
end = time.time()
throughput = (n_iters * batch_size) / (end - start)
print(f"Throughput: {throughput:.1f} samples/sec")
Latency
import numpy as np
latencies = []
for _ in range(100):
start = time.perf_counter()
with torch.no_grad():
_ = model(single_x)
latencies.append(time.perf_counter() - start)
print(f"p50: {np.percentile(latencies, 50)*1000:.2f}ms")
print(f"p95: {np.percentile(latencies, 95)*1000:.2f}ms")
print(f"p99: {np.percentile(latencies, 99)*1000:.2f}ms")
Next Steps
- Development Guide - Debug and verify experiments
- Performance Guide - Optimize benchmark runs
- Training Guide - Configure experiments