Docs

Topic 7: Evaluation & Metrics (Poles, Coverage, Stability)

How to measure pole learning quality, stability, and coverage; and how to log/visualize results.

What to Measure

  • Pole Localization Error (PLE): Distance between predicted and true poles.
  • Sign Consistency: Correctness of ±∞ transitions around poles.
  • Asymptotic Behavior: Fit of log|y| vs −log|Q| near poles.
  • Residual Consistency: R(x)=Q(x)·y(x)−P(x) near poles should be small.
  • Coverage Breakdown: REAL ratios near/mid/far from poles; tag distribution.

Pole Metrics API

  • High‑level evaluator and metrics container: zeroproof/utils/pole_metrics.py:1.
  • Use when you have true poles (synthetic/analytic) or to compare predicted pole sets.

Example (sketch)

from zeroproof.utils.pole_metrics import PoleEvaluator

true_poles = [0.0, 1.5]
peval = PoleEvaluator(true_poles=true_poles)
metrics = peval.evaluate(x_values, y_values, Q_values, P_values=None,
                         predicted_poles=maybe_predicted)
print(metrics.ple, metrics.sign_consistency)

Integrated Evaluation

  • One‑stop evaluation with plotting and JSON logging: zeroproof/utils/evaluation_api.py:1 (IntegratedEvaluator).
  • Configurable thresholds (near/mid distances), which metrics to compute, and visualization cadence.

Usage

from zeroproof.utils.evaluation_api import IntegratedEvaluator, EvaluationConfig

config = EvaluationConfig(enable_visualization=True, plot_frequency=5)
evaluator = IntegratedEvaluator(config=config, true_poles=[0.0])
metrics = evaluator.evaluate_model(model, x_values)

Outputs

  • Metrics dataclass with PLE, sign consistency, asymptotic slope error/correlation, residual stats, coverage breakdown.
  • Optional plots (saved to plot_dir), periodic by plot_frequency.
  • Log file evaluation_log.json with timestamped entries.

Robotics IK (2D) Metrics & Buckets

For the RR‑arm (θ2 singularities at {0, π}), use the 2D helpers and bucketed MSE by |det(J)|≈|sin θ2| to quantify near‑pole behavior.

  • 2D metrics helper: zeroproof/metrics/pole_2d.py
    • compute_ple_to_lines(test_inputs, predictions): PLE vs θ2∈{0, π}
    • compute_pole_metrics_2d(test_inputs, predictions): bundle of PLE, sign consistency, slope error, residual consistency
  • Bucketed test MSE by |det(J)| bins (B0–B4): include both MSE and counts per bucket
    • Edges (default): [0, 1e‑5, 1e‑4, 1e‑3, 1e‑2, inf]
    • Recommended: ensure B0–B3 have non‑zero counts for robust near‑pole evaluation
  • Comparator driver outputs both bucketed MSE and 2D pole metrics in JSON
    • Quick profile stratifies the test subset by |det(J)| and aligns DLS to the same subset for parity

Coverage and Tag Distribution

  • Use coverage trackers during training to monitor REAL vs non‑REAL; break down near‑pole coverage using Hybrid stats.
  • Code: zeroproof/training/coverage.py:1 and Hybrid context in autodiff/hybrid_gradient.py:1.

Diagnostics & Reports

  • DiagnosticMonitor: batch/epoch histories, q_min/q_mean, tag counts, gradient summaries, exports.
    • Code: zeroproof/training/sampling_diagnostics.py:480.
  • Verification report: see docs/verification_report.md:1 for a consolidated spec/behavior check.

Tests and Reproducibility

  • Property tests exercise totality, reduction modes, AD invariants.
    • See tests/unit/ and tests/property/.
  • Repro checklist: seeds and deterministic flags (per spec complete_v2.md:1).

Practical Guidance

  • Align evaluator thresholds (near/mid) with Hybrid δ for consistent analysis.
  • Visualize both y(x) and 1/|Q(x)| overlays to spot missed poles.
  • Track PLE over time; expect O(1/√n) decay with sufficient near‑pole samples.
  • Persist metrics and plots per checkpoint to compare schedules and policies.