Related papers: Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

URL: http://arxiv.org/abs/2502.13962v2
Date: Fri, 18 Jul 2025 01:01:54 GMT
Title: Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering
Authors: William Jurayj, Jeffrey Cheng, Benjamin Van Durme,
Abstract summary: We show that increasing compute budget at inference time helps models answer more questions correctly.<n>We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk.
Score: 33.2921120857455
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.

Related papers

Benchmarking Uncertainty Calibration in Large Language Model Long-Form Question Answering [7.1559850008795385]
Large Language Models (LLMs) are commonly used in Question Answering (QA) settings.<n>Existing UQ approaches remain weakly validated in scientific QA.<n>We introduce the first large-scale benchmark for evaluating UQ metrics in reasoning-demanding QA.
arXiv Detail & Related papers (2026-01-30T20:02:34Z)
Gaming the Answer Matcher: Examining the Impact of Text Manipulation on Automated Judgment [6.104512852467398]
Automated answer matching shows substantial promise as a scalable and aligned alternative to human evaluation.<n>We investigate whether such tactics deceive answer matching models by prompting examinee models to generate verbose responses.<n>Our results show that these manipulations do not increase scores and often reduce them.
arXiv Detail & Related papers (2025-12-22T17:39:13Z)
Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning [23.867629719024325]
We propose a basic framework: Answer Regeneration.<n>The method uses an additional model inference, providing the prior input and output prefaced by the prompt "Answer:"<n>We show that this extraction-rule-agnostic approach exhibits improved performance and enhanced robustness.
arXiv Detail & Related papers (2025-10-16T15:09:22Z)
Reliable and Efficient Amortized Model-based Evaluation [57.6469531082784]
The average score across a wide range of benchmarks provides a signal that helps guide the use of language models in practice. A popular attempt to lower the cost is to compute the average score on a subset of the benchmark. This approach often renders an unreliable measure of LM performance because the average score is often confounded with the difficulty of the questions in the benchmark subset. We train a model that predicts question difficulty from its content, enabling a reliable measurement at a fraction of the cost.
arXiv Detail & Related papers (2025-03-17T16:15:02Z)
Variability Need Not Imply Error: The Case of Adequate but Semantically Distinct Responses [7.581259361859477]
Uncertainty quantification tools can be used to reject a response when the model is uncertain'<n>We estimate the Probability the model assigns to Adequate Responses (PROBAR)<n>We find PROBAR to outperform semantic entropy across prompts with varying degrees of ambiguity/open-endedness.
arXiv Detail & Related papers (2024-12-20T09:02:26Z)
DiverseAgentEntropy: Quantifying Black-Box LLM Uncertainty through Diverse Perspectives and Multi-Agent Interaction [53.803276766404494]
Existing methods, which gauge a model's uncertainty through evaluating self-consistency in responses to the original query, do not always capture true uncertainty.<n>We propose a novel method, DiverseAgentEntropy, for evaluating a model's uncertainty using multi-agent interaction.<n>Our method offers a more accurate prediction of the model's reliability and further detects hallucinations, outperforming other self-consistency-based methods.
arXiv Detail & Related papers (2024-12-12T18:52:40Z)
Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations [85.81295563405433]
Language model users often issue queries that lack specification, where the context under which a query was issued is not explicit. We present contextualized evaluations, a protocol that synthetically constructs context surrounding an under-specified query and provides it during evaluation. We find that the presence of context can 1) alter conclusions drawn from evaluation, even flipping win rates between model pairs, 2) nudge evaluators to make fewer judgments based on surface-level criteria, like style, and 3) provide new insights about model behavior across diverse contexts.
arXiv Detail & Related papers (2024-11-11T18:58:38Z)
Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting Framework [77.45983464131977]
We focus on how likely it is that a RAG model's prediction is incorrect, resulting in uncontrollable risks in real-world applications.<n>Our research identifies two critical latent factors affecting RAG's confidence in its predictions.<n>We develop a counterfactual prompting framework that induces the models to alter these factors and analyzes the effect on their answers.
arXiv Detail & Related papers (2024-09-24T14:52:14Z)
"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models [40.867655189493924]
Open-ended nature of language generation makes evaluation of large language models (LLMs) challenging. One common evaluation approach uses multiple-choice questions (MCQ) to limit the response space. We evaluate how aligned first-token evaluation is with the text output along several dimensions.
arXiv Detail & Related papers (2024-02-22T12:47:33Z)
Uncertainty-aware Language Modeling for Selective Question Answering [107.47864420630923]
We present an automatic large language model (LLM) conversion approach that produces uncertainty-aware LLMs. Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems.
arXiv Detail & Related papers (2023-11-26T22:47:54Z)
Realistic Conversational Question Answering with Answer Selection based on Calibrated Confidence and Uncertainty Measurement [54.55643652781891]
Conversational Question Answering (ConvQA) models aim at answering a question with its relevant paragraph and previous question-answer pairs that occurred during conversation multiple times. We propose to filter out inaccurate answers in the conversation history based on their estimated confidences and uncertainties from the ConvQA model. We validate our models, Answer Selection-based realistic Conversation Question Answering, on two standard ConvQA datasets.
arXiv Detail & Related papers (2023-02-10T09:42:07Z)
Generative Context Pair Selection for Multi-hop Question Answering [60.74354009152721]
We propose a generative context selection model for multi-hop question answering. Our proposed generative passage selection model has a better performance (4.9% higher than baseline) on adversarial held-out set.
arXiv Detail & Related papers (2021-04-18T07:00:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.