Related papers: CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering

CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering

URL: http://arxiv.org/abs/2502.01523v1
Date: Mon, 03 Feb 2025 17:01:51 GMT
Title: CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering
Authors: Zongxi Li, Yang Li, Haoran Xie, S. Joe Qin,
Abstract summary: Large language models (LLMs) are prone to hallucinations in question-answering (QA) tasks when faced with ambiguous questions.<n>We introduce Conditional Ambiguous Question-Answering (CondAmbigQA), a benchmark with 200 ambiguous queries.<n>Our study pioneers the concept of conditions'' in ambiguous QA tasks, where conditions stand for contextual constraints or assumptions that resolve ambiguities.
Score: 6.297950040983263
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are prone to hallucinations in question-answering (QA) tasks when faced with ambiguous questions. Users often assume that LLMs share their cognitive alignment, a mutual understanding of context, intent, and implicit details, leading them to omit critical information in the queries. However, LLMs generate responses based on assumptions that can misalign with user intent, which may be perceived as hallucinations if they misalign with the user's intent. Therefore, identifying those implicit assumptions is crucial to resolve ambiguities in QA. Prior work, such as AmbigQA, reduces ambiguity in queries via human-annotated clarifications, which is not feasible in real application. Meanwhile, ASQA compiles AmbigQA's short answers into long-form responses but inherits human biases and fails capture explicit logical distinctions that differentiates the answers. We introduce Conditional Ambiguous Question-Answering (CondAmbigQA), a benchmark with 200 ambiguous queries and condition-aware evaluation metrics. Our study pioneers the concept of ``conditions'' in ambiguous QA tasks, where conditions stand for contextual constraints or assumptions that resolve ambiguities. The retrieval-based annotation strategy uses retrieved Wikipedia fragments to identify possible interpretations for a given query as its conditions and annotate the answers through those conditions. Such a strategy minimizes human bias introduced by different knowledge levels among annotators. By fixing retrieval results, CondAmbigQA evaluates how RAG systems leverage conditions to resolve ambiguities. Experiments show that models considering conditions before answering improve performance by $20\%$, with an additional $5\%$ gain when conditions are explicitly provided. These results underscore the value of conditional reasoning in QA, offering researchers tools to rigorously evaluate ambiguity resolution.

Related papers

MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs [15.278241998033822]
Open-ended question answering (QA) is a key task for evaluating the capabilities of large language models (LLMs)<n>We propose textbfMinosEval, a novel evaluation method that first distinguishes open-ended questions and then ranks candidate answers.
arXiv Detail & Related papers (2025-06-18T07:49:13Z)
CLEAR-KGQA: Clarification-Enhanced Ambiguity Resolution for Knowledge Graph Question Answering [13.624962763072899]
KGQA systems typically assume user queries are unambiguous, which is an assumption that rarely holds in real-world applications. We propose a novel framework that dynamically handles both entity ambiguity (e.g., distinguishing between entities with similar names) and intent ambiguity (e.g., clarifying different interpretations of user queries) through interactive clarification.
arXiv Detail & Related papers (2025-04-13T17:34:35Z)
Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations [85.81295563405433]
Language model users often issue queries that lack specification, where the context under which a query was issued is not explicit. We present contextualized evaluations, a protocol that synthetically constructs context surrounding an under-specified query and provides it during evaluation. We find that the presence of context can 1) alter conclusions drawn from evaluation, even flipping win rates between model pairs, 2) nudge evaluators to make fewer judgments based on surface-level criteria, like style, and 3) provide new insights about model behavior across diverse contexts.
arXiv Detail & Related papers (2024-11-11T18:58:38Z)
QUDSELECT: Selective Decoding for Questions Under Discussion Parsing [90.92351108691014]
Question Under Discussion (QUD) is a discourse framework that uses implicit questions to reveal discourse relationships between sentences. We introduce QUDSELECT, a joint-training framework that selectively decodes the QUD dependency structures considering the QUD criteria. Our method outperforms the state-of-the-art baseline models by 9% in human evaluation and 4% in automatic evaluation.
arXiv Detail & Related papers (2024-08-02T06:46:08Z)
S-EQA: Tackling Situational Queries in Embodied Question Answering [48.43453390717167]
We present and tackle the problem of Embodied Question Answering with Situational Queries (S-EQA) in a household environment. We first introduce a novel Prompt-Generate-Evaluate scheme that wraps around an LLM's output to create a dataset of unique situational queries and corresponding consensus object information. We report an improved accuracy of 15.31% while using queries framed from the generated object consensus for Visual Question Answering (VQA) over directly answering situational ones.
arXiv Detail & Related papers (2024-05-08T00:45:20Z)
Aligning Language Models to Explicitly Handle Ambiguity [22.078095273053506]
We propose Alignment with Perceived Ambiguity (APA), a novel pipeline that aligns language models to deal with ambiguous queries. Experimental results on question-answering datasets demonstrate that APA empowers LLMs to explicitly detect and manage ambiguous queries. Our finding proves that APA excels beyond training with gold-standard labels, especially in out-of-distribution scenarios.
arXiv Detail & Related papers (2024-04-18T07:59:53Z)
Clarify When Necessary: Resolving Ambiguity Through Interaction with LMs [58.620269228776294]
We propose a task-agnostic framework for resolving ambiguity by asking users clarifying questions. We evaluate systems across three NLP applications: question answering, machine translation and natural language inference. We find that intent-sim is robust, demonstrating improvements across a wide range of NLP tasks and LMs.
arXiv Detail & Related papers (2023-11-16T00:18:50Z)
SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation) We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z)
Answering Ambiguous Questions via Iterative Prompting [84.3426020642704]
In open-domain question answering, due to the ambiguity of questions, multiple plausible answers may exist. One approach is to directly predict all valid answers, but this can struggle with balancing relevance and diversity. We present AmbigPrompt to address the imperfections of existing approaches to answering ambiguous questions.
arXiv Detail & Related papers (2023-07-08T04:32:17Z)
ASQA: Factoid Questions Meet Long-Form Answers [35.11889930792675]
This work focuses on factoid questions that are ambiguous, that is, have different correct answers depending on interpretation. Answers to ambiguous questions should synthesize factual information from multiple sources into a long-form summary. We use this notion of correctness to define an automated metric of performance for ASQA.
arXiv Detail & Related papers (2022-04-12T21:58:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.