Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems
- URL: http://arxiv.org/abs/2512.11150v1
- Date: Thu, 11 Dec 2025 22:16:24 GMT
- Title: Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems
- Authors: Eddie Landesberg,
- Abstract summary: Uncalibrated scores can invert preferences, naive confidence intervals on uncalibrated scores achieve near-0% coverage, and importance-weighted estimators collapse under limited overlap.<n>We introduce Causal Judge Evaluation, a framework that fixes all three failures.
- Score: 0.29465623430708904
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LLM-as-judge evaluation has become the de facto standard for scaling model assessment, but the practice is statistically unsound: uncalibrated scores can invert preferences, naive confidence intervals on uncalibrated scores achieve near-0% coverage, and importance-weighted estimators collapse under limited overlap despite high effective sample size (ESS). We introduce Causal Judge Evaluation (CJE), a framework that fixes all three failures. On n=4,961 Chatbot Arena prompts (after filtering from 5k), CJE achieves 99% pairwise ranking accuracy at full sample size (94% averaged across configurations), matching oracle quality, at 14x lower cost (for ranking 5 policies) by calibrating a 16x cheaper judge on just 5% oracle labels (~250 labels). CJE combines three components: (i) AutoCal-R, reward calibration via mean-preserving isotonic regression; (ii) SIMCal-W, weight stabilization via stacking of S-monotone candidates; and (iii) Oracle-Uncertainty Aware (OUA) inference that propagates calibration uncertainty into confidence intervals. We formalize the Coverage-Limited Efficiency (CLE) diagnostic, which explains why IPS-style estimators fail even when ESS exceeds 90%: the logger rarely visits regions where target policies concentrate. Key findings: SNIPS inverts rankings even with reward calibration (38% pairwise, negative Kendall's tau) due to weight instability; calibrated IPS remains near-random (47%) despite weight stabilization, consistent with CLE; OUA improves coverage from near-0% to ~86% (Direct) and ~96% (stacked-DR), where naive intervals severely under-cover.
Related papers
- Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration [0.0]
We answer with a reliability level -- a single number per system-task pair.<n>Self-consistency sampling reduces uncertainty exponentially.<n> conformal calibration guarantees correctness within 1/(n+1) of the target level.
arXiv Detail & Related papers (2026-02-24T21:03:50Z) - Correctness-Optimized Residual Activation Lens (CORAL): Transferrable and Calibration-Aware Inference-Time Steering [3.7758197704962835]
We introduce CORAL, a regularized-time steering method that captures correctness signals from model internal activations using weight-decay probes.<n>CORAL consistently improves accuracy by 10% and expected calibration error (ECE) by 50% on average.<n>Our results support the hypothesis that distributed information in model internals can be extracted using regularized probes when individual neurons are insufficient.
arXiv Detail & Related papers (2026-02-05T18:55:56Z) - NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems [53.52419750390942]
Large language models (LLMs) are used in mission-critical factual domains.<n>LLMs exhibit poor calibration performance due to noisy retrieved contexts.<n>We propose NAACL Rules (Noise-AwAre Confidence CaLibration Rules) to provide a principled foundation for resolving overconfidence under noise.
arXiv Detail & Related papers (2026-01-16T05:38:25Z) - Learning Robust Representations for Malicious Content Detection via Contrastive Sampling and Uncertainty Estimation [0.0]
Uncertainty Contrastive Framework (UCF) integrates uncertainty-aware contrastive loss, adaptive temperature scaling, and a self-attention-guided LSTM encoder to improve classification under noisy and imbalanced conditions.<n>UCF dynamically adjusts contrastive weighting based on sample confidence, stabilizes training using positive anchors, and adapts temperature parameters to batch-level variability.
arXiv Detail & Related papers (2025-12-01T22:06:06Z) - Learnable Conformal Prediction with Context-Aware Nonconformity Functions for Robotic Planning and Perception [4.694504497452662]
Learnable Conformal Prediction replaces fixed scores with a lightweight neural function to produce context-aware uncertainty sets.<n>It maintains CP's theoretical guarantees while reducing prediction set sizes by 18% in classification, tightening detection intervals by 52%, and improving path planning safety from 72% to 91% success with minimal overhead.<n> Hardware evaluation shows LCP adds less than 1% memory and 15.9% inference overhead, yet sustains 39 FPS on detection tasks while being 7.4 times more energy-efficient than ensembles.
arXiv Detail & Related papers (2025-09-26T06:44:58Z) - TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them [58.04324690859212]
Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
arXiv Detail & Related papers (2025-09-25T13:04:29Z) - Uncertainty-aware Long-tailed Weights Model the Utility of Pseudo-labels for Semi-supervised Learning [50.868594148443215]
We propose an Uncertainty-aware Ensemble Structure (UES) to assess the utility of pseudo-labels for unlabeled samples.<n>UES is lightweight and architecture-agnostic, easily extending to various computer vision tasks, including classification and regression.
arXiv Detail & Related papers (2025-03-13T02:21:04Z) - Equal Opportunity of Coverage in Fair Regression [50.76908018786335]
We study fair machine learning (ML) under predictive uncertainty to enable reliable and trustworthy decision-making.
We propose Equal Opportunity of Coverage (EOC) that aims to achieve two properties: (1) coverage rates for different groups with similar outcomes are close, and (2) the coverage rate for the entire population remains at a predetermined level.
arXiv Detail & Related papers (2023-11-03T21:19:59Z) - BEA: Revisiting anchor-based object detection DNN using Budding Ensemble
Architecture [8.736601342033431]
Budding Ensemble Architecture (BEA) is a novel reduced ensemble architecture for anchor-based object detection models.
The proposed loss functions in BEA improve the confidence score calibration and lower the uncertainty error.
arXiv Detail & Related papers (2023-09-14T21:54:23Z) - Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence
Scores from Language Models Fine-Tuned with Human Feedback [91.22679548111127]
A trustworthy real-world prediction system should produce well-calibrated confidence scores.
We show that verbalized confidences emitted as output tokens are typically better-calibrated than the model's conditional probabilities.
arXiv Detail & Related papers (2023-05-24T10:12:33Z) - Beyond calibration: estimating the grouping loss of modern neural
networks [68.8204255655161]
Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss.
We show that modern neural network architectures in vision and NLP exhibit grouping loss, notably in distribution shifts settings.
arXiv Detail & Related papers (2022-10-28T07:04:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.