Related papers: E-Scores for (In)Correctness Assessment of Generative Model Outputs

E-Scores for (In)Correctness Assessment of Generative Model Outputs

URL: http://arxiv.org/abs/2510.25770v1
Date: Wed, 29 Oct 2025 17:59:16 GMT
Title: E-Scores for (In)Correctness Assessment of Generative Model Outputs
Authors: Guneet S. Dhillon, Javier González, Teodora Pandeva, Alicia Curth,
Abstract summary: We use e-values to complement generative model outputs with e-scores as a measure of incorrectness.<n>We experimentally demonstrate their efficacy in assessing LLM outputs for different correctness types.
Score: 14.303918797970601
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While generative models, especially large language models (LLMs), are ubiquitous in today's world, principled mechanisms to assess their (in)correctness are limited. Using the conformal prediction framework, previous works construct sets of LLM responses where the probability of including an incorrect response, or error, is capped at a desired user-defined tolerance level. However, since these methods are based on p-values, they are susceptible to p-hacking, i.e., choosing the tolerance level post-hoc can invalidate the guarantees. We therefore leverage e-values to complement generative model outputs with e-scores as a measure of incorrectness. In addition to achieving the same statistical guarantees as before, e-scores provide users flexibility in adaptively choosing tolerance levels after observing the e-scores themselves, by upper bounding a post-hoc notion of error called size distortion. We experimentally demonstrate their efficacy in assessing LLM outputs for different correctness types: mathematical factuality and property constraints satisfaction.

Related papers

Efficient Inference for Noisy LLM-as-a-Judge Evaluation [8.2511120576505]
Large language models (LLMs) are increasingly used as automatic evaluators of generative AI outputs.<n>In practice, LLM judges are imperfect predictions for the underlying truth and can exhibit systematic, non-random errors.
arXiv Detail & Related papers (2026-01-08T22:46:26Z)
LEC: Linear Expectation Constraints for False-Discovery Control in Selective Prediction and Routing Systems [95.35293543918762]
Large language models (LLMs) often generate unreliable answers, while uncertainty methods fail to fully distinguish correct from incorrect predictions.<n>We address this issue through the lens of false discovery rate (FDR) control, ensuring that among all accepted predictions, the proportion of errors does not exceed a target risk level.<n>We propose LEC, which reinterprets selective prediction as a constrained decision problem by enforcing a Linear Expectation Constraint.
arXiv Detail & Related papers (2025-12-01T11:27:09Z)
Unsupervised Conformal Inference: Bootstrapping and Alignment to Control LLM Uncertainty [49.19257648205146]
We propose an unsupervised conformal inference framework for generation.<n>Our gates achieve close-to-nominal coverage and provide tighter, more stable thresholds than split UCP.<n>The result is a label-free, API-compatible gate for test-time filtering.
arXiv Detail & Related papers (2025-09-26T23:40:47Z)
COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees [51.5976496056012]
COIN is an uncertainty-guarding selection framework that calibrates statistically valid thresholds to filter a single generated answer per question.<n>COIN estimates the empirical error rate on a calibration set and applies confidence interval methods to establish a high-probability upper bound on the true error rate.<n>We demonstrate COIN's robustness in risk control, strong test-time power in retaining admissible answers, and predictive efficiency under limited calibration data.
arXiv Detail & Related papers (2025-06-25T07:04:49Z)
Principled Input-Output-Conditioned Post-Hoc Uncertainty Estimation for Regression Networks [1.4671424999873808]
Uncertainty is critical in safety-sensitive applications but is often omitted from off-the-shelf neural networks due to adverse effects on predictive performance.<n>We propose a theoretically grounded framework for post-hoc uncertainty estimation in regression tasks by fitting an auxiliary model to both original inputs and frozen model outputs.
arXiv Detail & Related papers (2025-06-01T09:13:27Z)
Suitability Filter: A Statistical Framework for Classifier Evaluation in Real-World Deployment Settings [33.080398349395686]
We propose a novel framework designed to detect performance deterioration by utilizing suitability signals.<n>We aggregate suitability signals for both test and user data and compare these empirical distributions.<n>This enables proactive mitigation of potential failures in high-stakes applications.
arXiv Detail & Related papers (2025-05-28T13:37:04Z)
COPU: Conformal Prediction for Uncertainty Quantification in Natural Language Generation [14.461333001997449]
Uncertainty Quantification (UQ) for Natural Language Generation (NLG) is crucial for assessing the performance of Large Language Models (LLMs)<n>We propose ourmethod, a method that explicitly adds the ground truth to the candidate outputs and uses logit scores to measure nonconformity.
arXiv Detail & Related papers (2025-02-18T07:25:12Z)
Conformal Generative Modeling with Improved Sample Efficiency through Sequential Greedy Filtering [55.15192437680943]
Generative models lack rigorous statistical guarantees for their outputs.<n>We propose a sequential conformal prediction method producing prediction sets that satisfy a rigorous statistical guarantee.<n>This guarantee states that with high probability, the prediction sets contain at least one admissible (or valid) example.
arXiv Detail & Related papers (2024-10-02T15:26:52Z)
Identifying and Mitigating Social Bias Knowledge in Language Models [52.52955281662332]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.<n>FAST surpasses state-of-the-art baselines with superior debiasing performance.<n>This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
Improving Adaptive Conformal Prediction Using Self-Supervised Learning [72.2614468437919]
We train an auxiliary model with a self-supervised pretext task on top of an existing predictive model and use the self-supervised error as an additional feature to estimate nonconformity scores. We empirically demonstrate the benefit of the additional information using both synthetic and real data on the efficiency (width), deficit, and excess of conformal prediction intervals.
arXiv Detail & Related papers (2023-02-23T18:57:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.