Docs

Training Guide

This guide covers training neural networks with SCM semantics, including projective learning, gradient policies, and specialized loss functions.

Table of Contents

Projective Learning

When to Use Projective Mode

Projective learning lifts rational subgraphs to homogeneous tuples ⟨N,D⟩, allowing training on a smooth manifold while preserving strict SCM semantics at inference.

Use projective mode when:

  • Training rational heads that should avoid instantiating ⊥ during optimization
  • Gradient dead zones around Q ≈ 0 hurt convergence
  • Safety-critical outputs where distinguishing +∞ vs −∞ matters
  • Need smooth gradients through potential singularities

Skip projective mode when:

  • Working with simple SCM operations
  • Singularities are rare in training data
  • Model architecture doesn't have rational bottlenecks

How It Works

Encoding:

φ(x) = ⟨x, 1⟩     for finite values
φ(⊥) = ⟨1, 0⟩     for bottom

Decoding:

φ⁻¹(N, D) = N/D   when |D| ≥ τ_infer
φ⁻¹(N, D) = ⊥     when |D| < τ_infer

Detached Renormalization:

To prevent overflow without altering the represented value:

S = sg(√(N² + D²) + γ)
(N', D') ← (N/S, D/S)

The stop_gradient (sg) operator ensures the optimizer learns the direction of the tuple, not its magnitude. This creates "ghost gradients" that flow smoothly even when D → 0.

Integration Steps

  1. Lift targets to projective tuples:
from zeroproof.training.targets import lift_targets

# Finite targets: y_finite → ⟨y_finite, 1⟩
# Infinite targets: ±inf → ⟨±1, 0⟩
targets_n, targets_d = lift_targets(y_true)
  1. Use PROJECT gradient policy in projective regions:
from zeroproof.autodiff.policies import GradientPolicy, gradient_policy

with gradient_policy(GradientPolicy.PROJECT):
    loss.backward()
  1. Combine specialized losses (see Loss Functions section)

  2. Decode at boundaries and monitor coverage

Gap Region

Training uses stochastic thresholds (τ_train_min, τ_train_max) to avoid learning a brittle boundary. Inference uses a fixed τ_infer.

When τ_train > τ_infer, the interval [τ_infer, τ_train) is the gap region where inference returns a finite value but the denominator is numerically risky.

from zeroproof.inference import strict_inference, InferenceConfig

decoded, bottom_mask, gap_mask = strict_inference(
    N, D,
    config=InferenceConfig(tau_infer=1e-6, tau_train=1e-4)
)

Monitor gap_mask.sum() to track how often predictions fall in this uncertain zone.

Gradient Policies

Gradient policies control how backpropagation interacts with ⊥. Available in zeroproof.autodiff.policies.

Policy Options

Policy Behavior Use When
CLAMP Zeroes gradients on ⊥ paths; clamps finite gradients to [-1, 1] Default for SCM-only graphs
PROJECT Masks gradients when forward value is ⊥ Projective heads, points at infinity
REJECT Always zero gradient Learning through coverage/rejection losses only
PASSTHROUGH Gradients propagate through ⊥ Debugging only

Usage

from zeroproof.autodiff.policies import GradientPolicy, gradient_policy

# Global policy
with gradient_policy(GradientPolicy.PROJECT):
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()

# Per-layer policy (advanced)
from zeroproof.autodiff.policies import register_policy
register_policy(my_rational_layer, GradientPolicy.PROJECT)

Design Notes

  • Policies are deterministic and XLA/TorchScript compatible
  • No Python-side branching on tensors
  • Projective mode pairs PROJECT with detached renormalization

Loss Functions

ZeroProofML combines multiple losses to stabilize training and preserve orientation information.

1. Implicit Loss

Cross-product form that avoids direct division:

from zeroproof.losses.implicit import implicit_loss

# For projective outputs (N, D) and targets (Y_n, Y_d)
loss_fit = implicit_loss(N, D, Y_n, Y_d)

Formula:

E = (N · Y_d - D · Y_n)²
L_fit = mean(E / (sg(D² Y_d² + N² Y_n²) + γ))
  • Scale-invariant
  • Numerically stable when D → 0
  • Default γ = 1e-9

2. Margin Loss

Encourages denominators to stay away from zero:

from zeroproof.losses.margin import margin_loss

loss_margin = margin_loss(D, tau_train=1e-4)

Formula:

L_margin = mean(max(0, τ_train - |D|)²)
  • Penalizes denominators approaching τ_train
  • Can be masked to finite paths only
  • Default λ_margin = 0.1

3. Sign Consistency Loss

Disambiguates +∞ vs -∞ using projective cosine similarity:

from zeroproof.losses.sign import sign_consistency_loss

loss_sign = sign_consistency_loss(N, D, Y_n, Y_d, tau_sing=1e-3)

Formula:

L_sign = 𝟙(|Y_d| < τ_sing) · (1 - (N·Y_n + D·Y_d) / (‖(N,D)‖ ‖(Y_n,Y_d)‖))
  • Only applied to singular targets (|Y_d| < τ_sing)
  • Aligns orientation in projective space
  • Default λ_sign = 1.0

4. Coverage & Rejection Loss

Monitor and penalize low coverage (fraction of finite predictions):

from zeroproof.losses.coverage import rejection_loss

# Compute coverage
coverage = (bottom_mask.logical_not()).float().mean()

# Penalize if below threshold
loss_rej = rejection_loss(coverage, target_coverage=0.95)
  • Adaptive sampling can increase coverage over time
  • Early stopping when coverage stagnates
  • Default target: 95%

Combined Objective

from zeroproof.training.loss import SCMTrainingLoss

loss_fn = SCMTrainingLoss(
    lambda_margin=0.1,
    lambda_sign=1.0,
    lambda_rejection=0.01,
    tau_train=1e-4,
    tau_sing=1e-3,
    gamma=1e-9
)

total_loss = loss_fn(outputs=(N, D), targets=(Y_n, Y_d))

Training Loop

Using SCMTrainer

The reference trainer handles target lifting, gradient policies, and coverage monitoring:

from zeroproof.training import SCMTrainer, TrainingConfig

trainer = SCMTrainer(
    model=model,
    optimizer=optimizer,
    loss_fn=loss_fn,
    train_loader=train_loader,
    val_loader=val_loader,  # optional
    config=TrainingConfig(
        max_epochs=100,
        gradient_policy=GradientPolicy.PROJECT,
        coverage_threshold=0.90,
        coverage_patience=10,
        use_amp=True,  # mixed precision
        grad_accumulation_steps=1,
        tau_train_min=1e-4,
        tau_train_max=1e-4
    )
)

history = trainer.fit()

Manual Training Loop

For custom workflows:

from zeroproof.autodiff.policies import gradient_policy, GradientPolicy
from zeroproof.training.targets import lift_targets

model.train()
for epoch in range(num_epochs):
    for batch_x, batch_y in train_loader:
        # Lift targets to projective tuples
        Y_n, Y_d = lift_targets(batch_y)

        # Forward pass
        N, D = model(batch_x)  # projective outputs

        # Compute losses
        loss_fit = implicit_loss(N, D, Y_n, Y_d)
        loss_margin = margin_loss(D, tau_train=1e-4)
        loss_sign = sign_consistency_loss(N, D, Y_n, Y_d)

        total_loss = loss_fit + 0.1*loss_margin + 1.0*loss_sign

        # Backward with gradient policy
        optimizer.zero_grad()
        with gradient_policy(GradientPolicy.PROJECT):
            total_loss.backward()
        optimizer.step()

        # Monitor coverage
        _, bottom_mask, _ = strict_inference(N, D)
        coverage = (~bottom_mask).float().mean()
        print(f"Coverage: {coverage:.3f}")

Adaptive Loss Strategies

Coverage Control

Gradually increase target coverage as training progresses:

from zeroproof.training.adaptive import AdaptiveCoverageScheduler

scheduler = AdaptiveCoverageScheduler(
    initial_coverage=0.80,
    target_coverage=0.95,
    warmup_epochs=20
)

for epoch in range(num_epochs):
    target_cov = scheduler.step(epoch)
    loss_rej = rejection_loss(current_coverage, target_coverage=target_cov)

Threshold Perturbation

Perturb thresholds per batch to avoid brittle boundaries:

from zeroproof.training.thresholds import perturbed_threshold

for batch in train_loader:
    tau = perturbed_threshold(
        tau_train_min=1e-4,
        tau_train_max=2e-4,
        mode='uniform'  # or 'log_uniform'
    )
    loss_margin = margin_loss(D, tau_train=tau)

Best Practices

  1. Start with SCM mode before adding projective complexity
  2. Monitor coverage throughout training; early stop if stagnant
  3. Use sign consistency for all singular targets (±∞)
  4. Keep τ_train_min and τ_train_max close unless you need strong perturbations
  5. Log threshold distributions to understand near-singular exposure
  6. Validate on strict inference mode with τ_infer threshold
  7. Check gap_mask in production; reject predictions in the gap if needed

Hyperparameter Defaults

Based on Physics Trinity benchmarks (see scm/paper_2601.tex):

Parameter Default Range
γ (implicit loss stability) 1e-9 [1e-12, 1e-6]
τ_train (margin threshold) 1e-4 [1e-6, 1e-3]
τ_infer (strict inference) 1e-6 [1e-8, 1e-4]
τ_sing (sign label tolerance) 1e-3 [1e-4, 1e-2]
λ_margin 0.1 [0.01, 1.0]
λ_sign 1.0 [0.1, 10.0]
λ_rejection 0.01 [0.001, 0.1]

Next Steps