Related papers: When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification

When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification

URL: http://arxiv.org/abs/2602.11199v1
Date: Wed, 04 Feb 2026 02:21:01 GMT
Title: When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification
Authors: Jiale Zhao, Ke Fang, Lu Cheng,
Abstract summary: Large language models (LLMs) often respond even when prompts omit critical details or include misleading information.<n>We study how to evaluate and improve LLMs' ability to decide when and what to ask for clarification without sacrificing task performance.<n>We introduce AskBench, an interactive benchmark that converts standard QA pairs into multi-turn interactions with explicit checkpoints.
Score: 8.391356566325054
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) often respond even when prompts omit critical details or include misleading information, leading to hallucinations or reinforced misconceptions. We study how to evaluate and improve LLMs' ability to decide when and what to ask for clarification without sacrificing task performance. We introduce AskBench, an interactive benchmark that converts standard QA pairs into multi-turn interactions with explicit checkpoints. A unified judge loop evaluates final answers and simulates user responses as needed. AskBench covers two settings: AskMind, with intent-deficient queries requiring clarification, and AskOverconfidence, with queries containing false premises that must be identified and corrected. We further propose rubric-guided reinforcement learning with verifier-based rewards (RLVR), which uses structured rubrics to encourage targeted clarification. Experiments show consistent improvements in accuracy, rubric adherence, and interaction efficiency, with strong generalization to unseen domains.

Related papers

Inferential Question Answering [67.54465021408724]
We introduce Inferential QA -- a new task that challenges models to infer answers from answer-supporting passages which provide only clues.<n>To study this problem, we construct QUIT (QUestions requiring Inference from Texts) dataset, comprising 7,401 questions and 2.4M passages.<n>We show that methods effective on traditional QA tasks struggle in inferential QA: retrievers underperform, rerankers offer limited gains, and fine-tuning provides inconsistent improvements.
arXiv Detail & Related papers (2026-02-01T14:02:43Z)
Pardon? Evaluating Conversational Repair in Large Audio-Language Models [15.682992943165994]
We introduce a repair-aware evaluation setting that distinguishes between answerable and unanswerable audio inputs.<n>We propose the Evaluability Awareness and Repair (EAR) score, a non-compensatory metric that jointly evaluates task competence under answerable conditions and repair behavior under unanswerable conditions.
arXiv Detail & Related papers (2026-01-19T11:36:27Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
Clarifying Ambiguities: on the Role of Ambiguity Types in Prompting Methods for Clarification Generation [5.259846811078731]
We focus on the concept of ambiguity for clarification, seeking to model and integrate ambiguities in the clarification process.<n>We name this new prompting scheme Ambiguity Type-Chain of Thought (AT-CoT)
arXiv Detail & Related papers (2025-04-16T14:21:02Z)
CLEAR-KGQA: Clarification-Enhanced Ambiguity Resolution for Knowledge Graph Question Answering [13.624962763072899]
KGQA systems typically assume user queries are unambiguous, which is an assumption that rarely holds in real-world applications.<n>We propose a novel framework that dynamically handles both entity ambiguity (e.g., distinguishing between entities with similar names) and intent ambiguity (e.g., clarifying different interpretations of user queries) through interactive clarification.
arXiv Detail & Related papers (2025-04-13T17:34:35Z)
Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies [66.30619782227173]
Large language models (LLMs) can produce erroneous responses that sound fluent and convincing.<n>We identify several features of LLM responses that shape users' reliance.<n>We find that explanations increase reliance on both correct and incorrect responses.<n>We observe less reliance on incorrect responses when sources are provided or when explanations exhibit inconsistencies.
arXiv Detail & Related papers (2025-02-12T16:35:41Z)
Trust but Verify: Programmatic VLM Evaluation in the Wild [62.14071929143684]
Programmatic VLM Evaluation (PROVE) is a new benchmarking paradigm for evaluating VLM responses to open-ended queries. We benchmark the helpfulness-truthfulness trade-offs of a range ofVLMs on PROVE, finding that very few are in-fact able to achieve a good balance between the two.
arXiv Detail & Related papers (2024-10-17T01:19:18Z)
Claim Check-Worthiness Detection: How Well do LLMs Grasp Annotation Guidelines? [0.0]
We use zero- and few-shot LLM prompting to identify text segments requiring fact-checking. We evaluate the LLMs' predictive and calibration accuracy on five CD/CW datasets from diverse domains. Our results show that optimal prompt verbosity is domain-dependent, adding context does not improve performance.
arXiv Detail & Related papers (2024-04-18T13:31:05Z)
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge. We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z)
Retrospective Reader for Machine Reading Comprehension [90.6069071495214]
Machine reading comprehension (MRC) is an AI challenge that requires machine to determine the correct answers to questions based on a given passage. When unanswerable questions are involved in the MRC task, an essential verification module called verifier is especially required in addition to the encoder. This paper devotes itself to exploring better verifier design for the MRC task with unanswerable questions.
arXiv Detail & Related papers (2020-01-27T11:14:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.