Topic 7: Evaluation & Metrics (Poles, Coverage, Stability)
How to measure pole learning quality, stability, and coverage; and how to log/visualize results.
What to Measure
- Pole Localization Error (PLE): Distance between predicted and true poles.
- Sign Consistency: Correctness of ±∞ transitions around poles.
- Asymptotic Behavior: Fit of log|y| vs −log|Q| near poles.
- Residual Consistency: R(x)=Q(x)·y(x)−P(x) near poles should be small.
- Coverage Breakdown: REAL ratios near/mid/far from poles; tag distribution.
Pole Metrics API
- High‑level evaluator and metrics container:
zeroproof/utils/pole_metrics.py:1. - Use when you have true poles (synthetic/analytic) or to compare predicted pole sets.
Example (sketch)
from zeroproof.utils.pole_metrics import PoleEvaluator
true_poles = [0.0, 1.5]
peval = PoleEvaluator(true_poles=true_poles)
metrics = peval.evaluate(x_values, y_values, Q_values, P_values=None,
predicted_poles=maybe_predicted)
print(metrics.ple, metrics.sign_consistency)
Integrated Evaluation
- One‑stop evaluation with plotting and JSON logging:
zeroproof/utils/evaluation_api.py:1(IntegratedEvaluator). - Configurable thresholds (near/mid distances), which metrics to compute, and visualization cadence.
Usage
from zeroproof.utils.evaluation_api import IntegratedEvaluator, EvaluationConfig
config = EvaluationConfig(enable_visualization=True, plot_frequency=5)
evaluator = IntegratedEvaluator(config=config, true_poles=[0.0])
metrics = evaluator.evaluate_model(model, x_values)
Outputs
- Metrics dataclass with PLE, sign consistency, asymptotic slope error/correlation, residual stats, coverage breakdown.
- Optional plots (saved to
plot_dir), periodic byplot_frequency. - Log file
evaluation_log.jsonwith timestamped entries.
Robotics IK (2D) Metrics & Buckets
For the RR‑arm (θ2 singularities at {0, π}), use the 2D helpers and bucketed MSE by |det(J)|≈|sin θ2| to quantify near‑pole behavior.
- 2D metrics helper:
zeroproof/metrics/pole_2d.pycompute_ple_to_lines(test_inputs, predictions): PLE vs θ2∈{0, π}compute_pole_metrics_2d(test_inputs, predictions): bundle of PLE, sign consistency, slope error, residual consistency
- Bucketed test MSE by |det(J)| bins (B0–B4): include both MSE and counts per bucket
- Edges (default): [0, 1e‑5, 1e‑4, 1e‑3, 1e‑2, inf]
- Recommended: ensure B0–B3 have non‑zero counts for robust near‑pole evaluation
- Comparator driver outputs both bucketed MSE and 2D pole metrics in JSON
- Quick profile stratifies the test subset by |det(J)| and aligns DLS to the same subset for parity
Coverage and Tag Distribution
- Use coverage trackers during training to monitor REAL vs non‑REAL; break down near‑pole coverage using Hybrid stats.
- Code:
zeroproof/training/coverage.py:1and Hybrid context inautodiff/hybrid_gradient.py:1.
Diagnostics & Reports
- DiagnosticMonitor: batch/epoch histories, q_min/q_mean, tag counts, gradient summaries, exports.
- Code:
zeroproof/training/sampling_diagnostics.py:480.
- Code:
- Verification report: see
docs/verification_report.md:1for a consolidated spec/behavior check.
Tests and Reproducibility
- Property tests exercise totality, reduction modes, AD invariants.
- See
tests/unit/andtests/property/.
- See
- Repro checklist: seeds and deterministic flags (per spec
complete_v2.md:1).
Practical Guidance
- Align evaluator thresholds (near/mid) with Hybrid δ for consistent analysis.
- Visualize both y(x) and 1/|Q(x)| overlays to spot missed poles.
- Track PLE over time; expect O(1/√n) decay with sufficient near‑pole samples.
- Persist metrics and plots per checkpoint to compare schedules and policies.