Performance Guide

Optimization strategies for efficient SCM training and inference.

Core Performance Principles

ZeroProofML v0.4 achieves performance through:

Vectorized operations with explicit masks (no branching)
JIT-friendly code paths (TorchScript, XLA compatible)
Gradient policies that avoid NaN propagation
Projective mode for smooth optimization landscapes

Vectorized SCM Operations

Use Backend-Specific Helpers

Always prefer vectorized ops over Python loops:

# ❌ Slow: Python loop
results = []
for x, y in zip(xs, ys):
    results.append(scm_div(x, y))

# ✅ Fast: Vectorized
from zeroproof.scm.ops import scm_div_torch
result, bottom_mask = scm_div_torch(xs_tensor, ys_tensor, xs_mask, ys_mask)

Available Backends

NumPy: scm_*_numpy in zeroproof.scm.ops
PyTorch: scm_*_torch in zeroproof.scm.ops
JAX: scm_*_jax in zeroproof.scm.ops

All return (payload, mask) tuples for efficient processing.

Coverage Tracking

Track ⊥ rate instead of branching on individual values:

# Efficient coverage computation
coverage = (~bottom_mask).float().mean()

# Log per batch
if coverage < 0.9:
    logger.warning(f"Low coverage: {coverage:.3f}")

# Early stopping
if coverage < threshold for coverage_patience epochs:
    break

Gradient Policies

Choose policies based on your architecture:

Policy	Overhead	Use Case
CLAMP	Low	Default for most models
PROJECT	Medium	Projective rational heads
REJECT	Lowest	Coverage-based learning only
PASSTHROUGH	None	Debugging (unsafe)

# Per-layer configuration
from zeroproof.autodiff.policies import GradientPolicy, register_policy

register_policy(rational_head, GradientPolicy.PROJECT)
register_policy(backbone, GradientPolicy.CLAMP)

Projective Mode Optimization

When to Use

Projective mode adds overhead but improves convergence:

Use when:

Training deep rational networks
Poles appear frequently in training data
Need stable gradients near singularities

Skip when:

Simple SCM operations only
Singularities are rare
Inference speed is critical

Efficient Renormalization

Detached renormalization prevents gradient leakage:

from zeroproof.autodiff.projective import renormalize

# Efficient: detached norm
N, D = renormalize(N, D, gamma=1e-9)  # Auto-detects backend

# Inefficient: manual implementation
norm = torch.sqrt(N**2 + D**2)  # Gradients leak through norm

Mixed Precision Training

Enable AMP for faster training:

from zeroproof.training import TrainingConfig, SCMTrainer

config = TrainingConfig(
    use_amp=True,  # Automatic Mixed Precision
    max_epochs=100
)

trainer = SCMTrainer(model, optimizer, loss_fn, train_loader, config=config)

Benefits:

~2x faster on modern GPUs
Lower memory usage
SCM masks stay in full precision

Limitations:

Requires CUDA-capable GPU
Some ops may not support float16

Memory Optimization

Gradient Accumulation

Train with larger effective batch sizes:

config = TrainingConfig(
    batch_size=256,
    grad_accumulation_steps=4,  # Effective batch: 1024
    use_amp=True
)

Checkpoint Gradients

For very deep models:

import torch.utils.checkpoint as checkpoint

class DeepRationalModel(nn.Module):
    def forward(self, x):
        # Checkpoint expensive blocks
        x = checkpoint.checkpoint(self.block1, x)
        x = checkpoint.checkpoint(self.block2, x)
        return x

Inference Optimization

Batch Processing

Process multiple inputs efficiently:

# Batch inference
model.eval()
with torch.no_grad():
    for batch in dataloader:
        N, D = model(batch)
        decoded, bottom, gap = strict_inference(N, D, config)
        # Process batch

TorchScript Compilation

JIT-compile for production:

model.eval()
scripted = torch.jit.script(model)
torch.jit.save(scripted, "model_jit.pt")

# Use compiled model
loaded = torch.jit.load("model_jit.pt")
with torch.no_grad():
    output = loaded(x)

Quantization (Experimental)

For CPU deployment:

import torch.quantization

# Dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

Note: Verify SCM semantics are preserved after quantization.

Profiling

Built-in Profiling

import torch.profiler

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    record_shapes=True
) as prof:
    model(x)
    loss.backward()

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Coverage Profiling

Track where ⊥ occurs:

class CoverageProfiler:
    def __init__(self):
        self.layer_coverage = {}

    def log_layer(self, name, bottom_mask):
        coverage = (~bottom_mask).float().mean().item()
        self.layer_coverage[name] = coverage

    def report(self):
        for name, cov in sorted(self.layer_coverage.items()):
            print(f"{name}: {cov:.3f}")

Benchmarking

Run Benchmarks

python benchmarks/run_benchmarks.py --output results --suite all

Benchmark Suite

Located in benchmarks/:

scm_ops_bench.py - Vectorized operation throughput
projective_bench.py - Projective mode overhead
training_bench.py - End-to-end training speed
inference_bench.py - Deployment performance

Metrics to Track

Throughput: samples/second
Coverage: fraction of finite predictions
Memory: peak GPU memory
Latency: p50, p95, p99 inference time

Best Practices

Profile first: Measure before optimizing
Vectorize: Use backend-specific ops
Batch intelligently: Balance memory and throughput
Monitor coverage: Low coverage indicates training issues
Use AMP: Enable on compatible hardware
Benchmark regularly: Track performance across changes

Common Bottlenecks

Slow Training

Symptoms: Low GPU utilization, slow epochs

Solutions:

Increase batch size
Enable AMP
Use gradient accumulation
Check dataloader num_workers
Profile to find hotspots

High Memory Usage

Symptoms: OOM errors, can't increase batch size

Solutions:

Gradient accumulation (smaller batches)
Gradient checkpointing
Reduce model size
Use AMP (float16)
Clear cache between batches

Low Coverage

Symptoms: Many ⊥ predictions, unstable training

Solutions:

Increase rejection loss weight
Adjust margin loss threshold
Add more diverse training data
Check gradient policies
Verify target lifting

Hardware Recommendations

Training

GPU: NVIDIA with Tensor Cores (V100, A100, RTX 30/40 series)
Memory: 16GB+ for typical models
CPU: Multi-core for dataloader parallelism

Inference

GPU: Optional; CPU inference is practical for small batches
Memory: Depends on batch size and model
Latency: TorchScript on CPU: <10ms typical

Next Steps

Development Guide - Debug and verify performance
Experiments - Benchmark results and protocols
API Reference - Low-level optimization APIs

Docs

Performance Guide

Core Performance Principles

Vectorized SCM Operations

Use Backend-Specific Helpers

Available Backends

Coverage Tracking

Gradient Policies

Projective Mode Optimization

When to Use

Efficient Renormalization

Mixed Precision Training

Memory Optimization

Gradient Accumulation

Checkpoint Gradients

Inference Optimization

Batch Processing

TorchScript Compilation

Quantization (Experimental)

Profiling

Built-in Profiling

Coverage Profiling

Benchmarking

Run Benchmarks

Benchmark Suite

Metrics to Track

Best Practices

Common Bottlenecks

Slow Training

High Memory Usage

Low Coverage

Hardware Recommendations

Training

Inference

Next Steps