Docs

Performance Guide

Optimization strategies for efficient SCM training and inference.

Core Performance Principles

ZeroProofML v0.4 achieves performance through:

  1. Vectorized operations with explicit masks (no branching)
  2. JIT-friendly code paths (TorchScript, XLA compatible)
  3. Gradient policies that avoid NaN propagation
  4. Projective mode for smooth optimization landscapes

Vectorized SCM Operations

Use Backend-Specific Helpers

Always prefer vectorized ops over Python loops:

# ❌ Slow: Python loop
results = []
for x, y in zip(xs, ys):
    results.append(scm_div(x, y))

# ✅ Fast: Vectorized
from zeroproof.scm.ops import scm_div_torch
result, bottom_mask = scm_div_torch(xs_tensor, ys_tensor, xs_mask, ys_mask)

Available Backends

  • NumPy: scm_*_numpy in zeroproof.scm.ops
  • PyTorch: scm_*_torch in zeroproof.scm.ops
  • JAX: scm_*_jax in zeroproof.scm.ops

All return (payload, mask) tuples for efficient processing.

Coverage Tracking

Track ⊥ rate instead of branching on individual values:

# Efficient coverage computation
coverage = (~bottom_mask).float().mean()

# Log per batch
if coverage < 0.9:
    logger.warning(f"Low coverage: {coverage:.3f}")

# Early stopping
if coverage < threshold for coverage_patience epochs:
    break

Gradient Policies

Choose policies based on your architecture:

Policy Overhead Use Case
CLAMP Low Default for most models
PROJECT Medium Projective rational heads
REJECT Lowest Coverage-based learning only
PASSTHROUGH None Debugging (unsafe)
# Per-layer configuration
from zeroproof.autodiff.policies import GradientPolicy, register_policy

register_policy(rational_head, GradientPolicy.PROJECT)
register_policy(backbone, GradientPolicy.CLAMP)

Projective Mode Optimization

When to Use

Projective mode adds overhead but improves convergence:

Use when:

  • Training deep rational networks
  • Poles appear frequently in training data
  • Need stable gradients near singularities

Skip when:

  • Simple SCM operations only
  • Singularities are rare
  • Inference speed is critical

Efficient Renormalization

Detached renormalization prevents gradient leakage:

from zeroproof.autodiff.projective import renormalize

# Efficient: detached norm
N, D = renormalize(N, D, gamma=1e-9)  # Auto-detects backend

# Inefficient: manual implementation
norm = torch.sqrt(N**2 + D**2)  # Gradients leak through norm

Mixed Precision Training

Enable AMP for faster training:

from zeroproof.training import TrainingConfig, SCMTrainer

config = TrainingConfig(
    use_amp=True,  # Automatic Mixed Precision
    max_epochs=100
)

trainer = SCMTrainer(model, optimizer, loss_fn, train_loader, config=config)

Benefits:

  • ~2x faster on modern GPUs
  • Lower memory usage
  • SCM masks stay in full precision

Limitations:

  • Requires CUDA-capable GPU
  • Some ops may not support float16

Memory Optimization

Gradient Accumulation

Train with larger effective batch sizes:

config = TrainingConfig(
    batch_size=256,
    grad_accumulation_steps=4,  # Effective batch: 1024
    use_amp=True
)

Checkpoint Gradients

For very deep models:

import torch.utils.checkpoint as checkpoint

class DeepRationalModel(nn.Module):
    def forward(self, x):
        # Checkpoint expensive blocks
        x = checkpoint.checkpoint(self.block1, x)
        x = checkpoint.checkpoint(self.block2, x)
        return x

Inference Optimization

Batch Processing

Process multiple inputs efficiently:

# Batch inference
model.eval()
with torch.no_grad():
    for batch in dataloader:
        N, D = model(batch)
        decoded, bottom, gap = strict_inference(N, D, config)
        # Process batch

TorchScript Compilation

JIT-compile for production:

model.eval()
scripted = torch.jit.script(model)
torch.jit.save(scripted, "model_jit.pt")

# Use compiled model
loaded = torch.jit.load("model_jit.pt")
with torch.no_grad():
    output = loaded(x)

Quantization (Experimental)

For CPU deployment:

import torch.quantization

# Dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

Note: Verify SCM semantics are preserved after quantization.

Profiling

Built-in Profiling

import torch.profiler

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    record_shapes=True
) as prof:
    model(x)
    loss.backward()

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Coverage Profiling

Track where ⊥ occurs:

class CoverageProfiler:
    def __init__(self):
        self.layer_coverage = {}

    def log_layer(self, name, bottom_mask):
        coverage = (~bottom_mask).float().mean().item()
        self.layer_coverage[name] = coverage

    def report(self):
        for name, cov in sorted(self.layer_coverage.items()):
            print(f"{name}: {cov:.3f}")

Benchmarking

Run Benchmarks

python benchmarks/run_benchmarks.py --output results --suite all

Benchmark Suite

Located in benchmarks/:

  • scm_ops_bench.py - Vectorized operation throughput
  • projective_bench.py - Projective mode overhead
  • training_bench.py - End-to-end training speed
  • inference_bench.py - Deployment performance

Metrics to Track

  1. Throughput: samples/second
  2. Coverage: fraction of finite predictions
  3. Memory: peak GPU memory
  4. Latency: p50, p95, p99 inference time

Best Practices

  1. Profile first: Measure before optimizing
  2. Vectorize: Use backend-specific ops
  3. Batch intelligently: Balance memory and throughput
  4. Monitor coverage: Low coverage indicates training issues
  5. Use AMP: Enable on compatible hardware
  6. Benchmark regularly: Track performance across changes

Common Bottlenecks

Slow Training

Symptoms: Low GPU utilization, slow epochs

Solutions:

  1. Increase batch size
  2. Enable AMP
  3. Use gradient accumulation
  4. Check dataloader num_workers
  5. Profile to find hotspots

High Memory Usage

Symptoms: OOM errors, can't increase batch size

Solutions:

  1. Gradient accumulation (smaller batches)
  2. Gradient checkpointing
  3. Reduce model size
  4. Use AMP (float16)
  5. Clear cache between batches

Low Coverage

Symptoms: Many ⊥ predictions, unstable training

Solutions:

  1. Increase rejection loss weight
  2. Adjust margin loss threshold
  3. Add more diverse training data
  4. Check gradient policies
  5. Verify target lifting

Hardware Recommendations

Training

  • GPU: NVIDIA with Tensor Cores (V100, A100, RTX 30/40 series)
  • Memory: 16GB+ for typical models
  • CPU: Multi-core for dataloader parallelism

Inference

  • GPU: Optional; CPU inference is practical for small batches
  • Memory: Depends on batch size and model
  • Latency: TorchScript on CPU: <10ms typical

Next Steps