Performance Guide
Optimization strategies for efficient SCM training and inference.
Core Performance Principles
ZeroProofML v0.4 achieves performance through:
- Vectorized operations with explicit masks (no branching)
- JIT-friendly code paths (TorchScript, XLA compatible)
- Gradient policies that avoid NaN propagation
- Projective mode for smooth optimization landscapes
Vectorized SCM Operations
Use Backend-Specific Helpers
Always prefer vectorized ops over Python loops:
# ❌ Slow: Python loop
results = []
for x, y in zip(xs, ys):
results.append(scm_div(x, y))
# ✅ Fast: Vectorized
from zeroproof.scm.ops import scm_div_torch
result, bottom_mask = scm_div_torch(xs_tensor, ys_tensor, xs_mask, ys_mask)
Available Backends
- NumPy:
scm_*_numpyinzeroproof.scm.ops - PyTorch:
scm_*_torchinzeroproof.scm.ops - JAX:
scm_*_jaxinzeroproof.scm.ops
All return (payload, mask) tuples for efficient processing.
Coverage Tracking
Track ⊥ rate instead of branching on individual values:
# Efficient coverage computation
coverage = (~bottom_mask).float().mean()
# Log per batch
if coverage < 0.9:
logger.warning(f"Low coverage: {coverage:.3f}")
# Early stopping
if coverage < threshold for coverage_patience epochs:
break
Gradient Policies
Choose policies based on your architecture:
| Policy | Overhead | Use Case |
|---|---|---|
| CLAMP | Low | Default for most models |
| PROJECT | Medium | Projective rational heads |
| REJECT | Lowest | Coverage-based learning only |
| PASSTHROUGH | None | Debugging (unsafe) |
# Per-layer configuration
from zeroproof.autodiff.policies import GradientPolicy, register_policy
register_policy(rational_head, GradientPolicy.PROJECT)
register_policy(backbone, GradientPolicy.CLAMP)
Projective Mode Optimization
When to Use
Projective mode adds overhead but improves convergence:
Use when:
- Training deep rational networks
- Poles appear frequently in training data
- Need stable gradients near singularities
Skip when:
- Simple SCM operations only
- Singularities are rare
- Inference speed is critical
Efficient Renormalization
Detached renormalization prevents gradient leakage:
from zeroproof.autodiff.projective import renormalize
# Efficient: detached norm
N, D = renormalize(N, D, gamma=1e-9) # Auto-detects backend
# Inefficient: manual implementation
norm = torch.sqrt(N**2 + D**2) # Gradients leak through norm
Mixed Precision Training
Enable AMP for faster training:
from zeroproof.training import TrainingConfig, SCMTrainer
config = TrainingConfig(
use_amp=True, # Automatic Mixed Precision
max_epochs=100
)
trainer = SCMTrainer(model, optimizer, loss_fn, train_loader, config=config)
Benefits:
- ~2x faster on modern GPUs
- Lower memory usage
- SCM masks stay in full precision
Limitations:
- Requires CUDA-capable GPU
- Some ops may not support float16
Memory Optimization
Gradient Accumulation
Train with larger effective batch sizes:
config = TrainingConfig(
batch_size=256,
grad_accumulation_steps=4, # Effective batch: 1024
use_amp=True
)
Checkpoint Gradients
For very deep models:
import torch.utils.checkpoint as checkpoint
class DeepRationalModel(nn.Module):
def forward(self, x):
# Checkpoint expensive blocks
x = checkpoint.checkpoint(self.block1, x)
x = checkpoint.checkpoint(self.block2, x)
return x
Inference Optimization
Batch Processing
Process multiple inputs efficiently:
# Batch inference
model.eval()
with torch.no_grad():
for batch in dataloader:
N, D = model(batch)
decoded, bottom, gap = strict_inference(N, D, config)
# Process batch
TorchScript Compilation
JIT-compile for production:
model.eval()
scripted = torch.jit.script(model)
torch.jit.save(scripted, "model_jit.pt")
# Use compiled model
loaded = torch.jit.load("model_jit.pt")
with torch.no_grad():
output = loaded(x)
Quantization (Experimental)
For CPU deployment:
import torch.quantization
# Dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
Note: Verify SCM semantics are preserved after quantization.
Profiling
Built-in Profiling
import torch.profiler
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
record_shapes=True
) as prof:
model(x)
loss.backward()
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
Coverage Profiling
Track where ⊥ occurs:
class CoverageProfiler:
def __init__(self):
self.layer_coverage = {}
def log_layer(self, name, bottom_mask):
coverage = (~bottom_mask).float().mean().item()
self.layer_coverage[name] = coverage
def report(self):
for name, cov in sorted(self.layer_coverage.items()):
print(f"{name}: {cov:.3f}")
Benchmarking
Run Benchmarks
python benchmarks/run_benchmarks.py --output results --suite all
Benchmark Suite
Located in benchmarks/:
scm_ops_bench.py- Vectorized operation throughputprojective_bench.py- Projective mode overheadtraining_bench.py- End-to-end training speedinference_bench.py- Deployment performance
Metrics to Track
- Throughput: samples/second
- Coverage: fraction of finite predictions
- Memory: peak GPU memory
- Latency: p50, p95, p99 inference time
Best Practices
- Profile first: Measure before optimizing
- Vectorize: Use backend-specific ops
- Batch intelligently: Balance memory and throughput
- Monitor coverage: Low coverage indicates training issues
- Use AMP: Enable on compatible hardware
- Benchmark regularly: Track performance across changes
Common Bottlenecks
Slow Training
Symptoms: Low GPU utilization, slow epochs
Solutions:
- Increase batch size
- Enable AMP
- Use gradient accumulation
- Check dataloader num_workers
- Profile to find hotspots
High Memory Usage
Symptoms: OOM errors, can't increase batch size
Solutions:
- Gradient accumulation (smaller batches)
- Gradient checkpointing
- Reduce model size
- Use AMP (float16)
- Clear cache between batches
Low Coverage
Symptoms: Many ⊥ predictions, unstable training
Solutions:
- Increase rejection loss weight
- Adjust margin loss threshold
- Add more diverse training data
- Check gradient policies
- Verify target lifting
Hardware Recommendations
Training
- GPU: NVIDIA with Tensor Cores (V100, A100, RTX 30/40 series)
- Memory: 16GB+ for typical models
- CPU: Multi-core for dataloader parallelism
Inference
- GPU: Optional; CPU inference is practical for small batches
- Memory: Depends on batch size and model
- Latency: TorchScript on CPU: <10ms typical
Next Steps
- Development Guide - Debug and verify performance
- Experiments - Benchmark results and protocols
- API Reference - Low-level optimization APIs