Related papers: CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering

CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering

URL: http://arxiv.org/abs/2502.01523v2
Date: Wed, 10 Sep 2025 18:27:02 GMT
Title: CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering
Authors: Zongxi Li, Yang Li, Haoran Xie, S. Joe Qin,
Abstract summary: Conditional Ambiguous Question-Answering (CondAmbigQA) is a benchmark comprising 2,000 ambiguous queries and condition-aware evaluation metrics.<n>Experiments demonstrate that models considering conditions before answering improve answer accuracy by 11.75%, with an additional 7.15% gain when conditions are explicitly provided.
Score: 9.50840225852638
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Users often assume that large language models (LLMs) share their cognitive alignment of context and intent, leading them to omit critical information in question-answering (QA) and produce ambiguous queries. Responses based on misaligned assumptions may be perceived as hallucinations. Therefore, identifying possible implicit assumptions is crucial in QA. To address this fundamental challenge, we propose Conditional Ambiguous Question-Answering (CondAmbigQA), a benchmark comprising 2,000 ambiguous queries and condition-aware evaluation metrics. Our study pioneers "conditions" as explicit contextual constraints that resolve ambiguities in QA tasks through retrieval-based annotation, where retrieved Wikipedia fragments help identify possible interpretations for a given query and annotate answers accordingly. Experiments demonstrate that models considering conditions before answering improve answer accuracy by 11.75%, with an additional 7.15% gain when conditions are explicitly provided. These results highlight that apparent hallucinations may stem from inherent query ambiguity rather than model failure, and demonstrate the effectiveness of condition reasoning in QA, providing researchers with tools for rigorous evaluation.

Related papers

Inferential Question Answering [67.54465021408724]
We introduce Inferential QA -- a new task that challenges models to infer answers from answer-supporting passages which provide only clues.<n>To study this problem, we construct QUIT (QUestions requiring Inference from Texts) dataset, comprising 7,401 questions and 2.4M passages.<n>We show that methods effective on traditional QA tasks struggle in inferential QA: retrievers underperform, rerankers offer limited gains, and fine-tuning provides inconsistent improvements.
arXiv Detail & Related papers (2026-02-01T14:02:43Z)
Benchmarking Uncertainty Calibration in Large Language Model Long-Form Question Answering [7.1559850008795385]
Large Language Models (LLMs) are commonly used in Question Answering (QA) settings.<n>Existing UQ approaches remain weakly validated in scientific QA.<n>We introduce the first large-scale benchmark for evaluating UQ metrics in reasoning-demanding QA.
arXiv Detail & Related papers (2026-01-30T20:02:34Z)
MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs [15.278241998033822]
Open-ended question answering (QA) is a key task for evaluating the capabilities of large language models (LLMs)<n>We propose textbfMinosEval, a novel evaluation method that first distinguishes open-ended questions and then ranks candidate answers.
arXiv Detail & Related papers (2025-06-18T07:49:13Z)
ConSens: Assessing context grounding in open-book question answering [0.0]
Large Language Models (LLMs) have demonstrated considerable success in open-book question answering (QA)<n>A critical challenge in open-book QA is to ensure that model responses are based on the provided context rather than its parametric knowledge.<n>We propose a novel metric that contrasts the perplexity of the model response under two conditions.<n>The resulting score quantifies the extent to which the model's answer relies on the provided context.
arXiv Detail & Related papers (2025-04-30T16:23:15Z)
CLEAR-KGQA: Clarification-Enhanced Ambiguity Resolution for Knowledge Graph Question Answering [13.624962763072899]
KGQA systems typically assume user queries are unambiguous, which is an assumption that rarely holds in real-world applications. We propose a novel framework that dynamically handles both entity ambiguity (e.g., distinguishing between entities with similar names) and intent ambiguity (e.g., clarifying different interpretations of user queries) through interactive clarification.
arXiv Detail & Related papers (2025-04-13T17:34:35Z)
Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores [16.434748534272014]
PlausibleQA is a dataset of 10,000 questions and 100,000 candidate answers annotated with plausibility scores and justifications.<n>We show that plausibility-aware approaches are effective for Multiple-Choice Question Answering (MCQA) and QARA.
arXiv Detail & Related papers (2025-02-22T21:14:18Z)
Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations [85.81295563405433]
Language model users often issue queries that lack specification, where the context under which a query was issued is not explicit. We present contextualized evaluations, a protocol that synthetically constructs context surrounding an under-specified query and provides it during evaluation. We find that the presence of context can 1) alter conclusions drawn from evaluation, even flipping win rates between model pairs, 2) nudge evaluators to make fewer judgments based on surface-level criteria, like style, and 3) provide new insights about model behavior across diverse contexts.
arXiv Detail & Related papers (2024-11-11T18:58:38Z)
QUDSELECT: Selective Decoding for Questions Under Discussion Parsing [90.92351108691014]
Question Under Discussion (QUD) is a discourse framework that uses implicit questions to reveal discourse relationships between sentences. We introduce QUDSELECT, a joint-training framework that selectively decodes the QUD dependency structures considering the QUD criteria. Our method outperforms the state-of-the-art baseline models by 9% in human evaluation and 4% in automatic evaluation.
arXiv Detail & Related papers (2024-08-02T06:46:08Z)
Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context.<n>We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions.<n>We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z)
S-EQA: Tackling Situational Queries in Embodied Question Answering [48.43453390717167]
We present and tackle the problem of Embodied Question Answering with Situational Queries (S-EQA) in a household environment. We first introduce a novel Prompt-Generate-Evaluate scheme that wraps around an LLM's output to create a dataset of unique situational queries and corresponding consensus object information. We report an improved accuracy of 15.31% while using queries framed from the generated object consensus for Visual Question Answering (VQA) over directly answering situational ones.
arXiv Detail & Related papers (2024-05-08T00:45:20Z)
Aligning Language Models to Explicitly Handle Ambiguity [22.078095273053506]
We propose Alignment with Perceived Ambiguity (APA), a novel pipeline that aligns language models to deal with ambiguous queries. Experimental results on question-answering datasets demonstrate that APA empowers LLMs to explicitly detect and manage ambiguous queries. Our finding proves that APA excels beyond training with gold-standard labels, especially in out-of-distribution scenarios.
arXiv Detail & Related papers (2024-04-18T07:59:53Z)
Clarify When Necessary: Resolving Ambiguity Through Interaction with LMs [58.620269228776294]
We propose a task-agnostic framework for resolving ambiguity by asking users clarifying questions. We evaluate systems across three NLP applications: question answering, machine translation and natural language inference. We find that intent-sim is robust, demonstrating improvements across a wide range of NLP tasks and LMs.
arXiv Detail & Related papers (2023-11-16T00:18:50Z)
SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation) We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z)
Answering Ambiguous Questions via Iterative Prompting [84.3426020642704]
In open-domain question answering, due to the ambiguity of questions, multiple plausible answers may exist. One approach is to directly predict all valid answers, but this can struggle with balancing relevance and diversity. We present AmbigPrompt to address the imperfections of existing approaches to answering ambiguous questions.
arXiv Detail & Related papers (2023-07-08T04:32:17Z)
Mastering the ABCDs of Complex Questions: Answer-Based Claim Decomposition for Fine-grained Self-Evaluation [9.776667356119352]
We propose answer-based claim decomposition (ABCD), a prompting strategy that decomposes questions into true/false claims. Using the decomposed ABCD claims, we perform fine-grained self-evaluation. We find that GPT-3.5 has some ability to determine to what extent its answer satisfies the criteria of the input question.
arXiv Detail & Related papers (2023-05-24T05:53:11Z)
Realistic Conversational Question Answering with Answer Selection based on Calibrated Confidence and Uncertainty Measurement [54.55643652781891]
Conversational Question Answering (ConvQA) models aim at answering a question with its relevant paragraph and previous question-answer pairs that occurred during conversation multiple times. We propose to filter out inaccurate answers in the conversation history based on their estimated confidences and uncertainties from the ConvQA model. We validate our models, Answer Selection-based realistic Conversation Question Answering, on two standard ConvQA datasets.
arXiv Detail & Related papers (2023-02-10T09:42:07Z)
ASQA: Factoid Questions Meet Long-Form Answers [35.11889930792675]
This work focuses on factoid questions that are ambiguous, that is, have different correct answers depending on interpretation. Answers to ambiguous questions should synthesize factual information from multiple sources into a long-form summary. We use this notion of correctness to define an automated metric of performance for ASQA.
arXiv Detail & Related papers (2022-04-12T21:58:44Z)
Asking and Answering Questions to Evaluate the Factual Consistency of Summaries [80.65186293015135]
We propose an automatic evaluation protocol called QAGS (pronounced "kags") to identify factual inconsistencies in a generated summary. QAGS is based on the intuition that if we ask questions about a summary and its source, we will receive similar answers if the summary is factually consistent with the source. We believe QAGS is a promising tool in automatically generating usable and factually consistent text.
arXiv Detail & Related papers (2020-04-08T20:01:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.