ConSens: Assessing context grounding in open-book question answering
- URL: http://arxiv.org/abs/2505.00065v1
- Date: Wed, 30 Apr 2025 16:23:15 GMT
- Title: ConSens: Assessing context grounding in open-book question answering
- Authors: Ivan Vankov, Matyo Ivanov, Adriana Correia, Victor Botev,
- Abstract summary: Large Language Models (LLMs) have demonstrated considerable success in open-book question answering (QA)<n>A critical challenge in open-book QA is to ensure that model responses are based on the provided context rather than its parametric knowledge.<n>We propose a novel metric that contrasts the perplexity of the model response under two conditions.<n>The resulting score quantifies the extent to which the model's answer relies on the provided context.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated considerable success in open-book question answering (QA), where the task requires generating answers grounded in a provided external context. A critical challenge in open-book QA is to ensure that model responses are based on the provided context rather than its parametric knowledge, which can be outdated, incomplete, or incorrect. Existing evaluation methods, primarily based on the LLM-as-a-judge approach, face significant limitations, including biases, scalability issues, and dependence on costly external systems. To address these challenges, we propose a novel metric that contrasts the perplexity of the model response under two conditions: when the context is provided and when it is not. The resulting score quantifies the extent to which the model's answer relies on the provided context. The validity of this metric is demonstrated through a series of experiments that show its effectiveness in identifying whether a given answer is grounded in the provided context. Unlike existing approaches, this metric is computationally efficient, interpretable, and adaptable to various use cases, offering a scalable and practical solution to assess context utilization in open-book QA systems.
Related papers
- Inferential Question Answering [67.54465021408724]
We introduce Inferential QA -- a new task that challenges models to infer answers from answer-supporting passages which provide only clues.<n>To study this problem, we construct QUIT (QUestions requiring Inference from Texts) dataset, comprising 7,401 questions and 2.4M passages.<n>We show that methods effective on traditional QA tasks struggle in inferential QA: retrievers underperform, rerankers offer limited gains, and fine-tuning provides inconsistent improvements.
arXiv Detail & Related papers (2026-02-01T14:02:43Z) - ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering [54.72902502486611]
ReAG is a Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages.<n>ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence.
arXiv Detail & Related papers (2025-11-27T19:01:02Z) - BookAsSumQA: An Evaluation Framework for Aspect-Based Book Summarization via Question Answering [2.703301365475554]
BookAsSumQA is a QA-based evaluation framework for aspect-based book summarization.<n>Our experiments showed that while LLM-based approaches showed higher accuracy on shorter texts, RAG-based methods become more effective as document length increases.
arXiv Detail & Related papers (2025-11-09T01:54:53Z) - Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering [29.4458902836278]
We introduce a task-agnostic, token-level uncertainty measure defined as the cross-entropy between the predictive distribution of the given model and the unknown true distribution.<n>We derive an upper bound for uncertainty and show that it can be interpreted as semantic feature gaps in the given model's hidden representations.<n>We apply this generic framework to the contextual QA task and hypothesize that three features approximate this gap: context-reliance, context comprehension, and honesty.
arXiv Detail & Related papers (2025-10-03T02:09:25Z) - Influence Guided Context Selection for Effective Retrieval-Augmented Generation [23.188397777606095]
Retrieval-Augmented Generation (RAG) addresses large language model (LLM) hallucinations by grounding responses in external knowledge.<n>Existing approaches attempt to improve performance through context selection based on predefined context quality assessment metrics.<n>We reconceptualize context quality assessment as an inference-time data valuation problem and introduce the Contextual Influence Value (CI value)<n>This novel metric quantifies context quality by measuring the performance degradation when removing each context from the list.
arXiv Detail & Related papers (2025-09-21T07:19:09Z) - Federated In-Context Learning: Iterative Refinement for Improved Answer Quality [62.72381208029899]
In-context learning (ICL) enables language models to generate responses without modifying their parameters by leveraging examples provided in the input.<n>We propose Federated In-Context Learning (Fed-ICL), a general framework that enhances ICL through an iterative, collaborative process.<n>Fed-ICL progressively refines responses by leveraging multi-round interactions between clients and a central server, improving answer quality without the need to transmit model parameters.
arXiv Detail & Related papers (2025-06-09T05:33:28Z) - Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA)<n>In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z) - Uncertainty Quantification in Retrieval Augmented Question Answering [57.05827081638329]
We propose to quantify the uncertainty of a QA model via estimating the utility of the passages it is provided with.<n>We train a lightweight neural model to predict passage utility for a target QA model and show that while simple information theoretic metrics can predict answer correctness up to a certain extent, our approach efficiently approximates or outperforms more expensive sampling-based methods.
arXiv Detail & Related papers (2025-02-25T11:24:52Z) - Context Filtering with Reward Modeling in Question Answering [7.668954669688971]
We introduce a context filtering approach that removes non-essential details, summarizing crucial content through Reward Modeling.<n>We show that our approach can significantly outperform the baseline, as evidenced by a 6.8-fold increase in the EM Per Token (EPT) metric.
arXiv Detail & Related papers (2024-12-16T12:29:24Z) - Enhancing Robustness of Retrieval-Augmented Language Models with In-Context Learning [5.053086684547045]
This study introduces an in-context learning-based approach to enhance the reasoning capabilities of RALMs.
Our approach increases accuracy in identifying unanswerable and conflicting scenarios without requiring additional fine-tuning.
arXiv Detail & Related papers (2024-08-08T12:42:43Z) - Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context.<n>We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions.<n>We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z) - Towards Better Question Generation in QA-based Event Extraction [3.699715556687871]
Event Extraction (EE) aims to extract event-related information from unstructured texts.
The quality of the questions dramatically affects the extraction accuracy.
We propose a reinforcement learning method, RLQG, for QA-based EE.
arXiv Detail & Related papers (2024-05-17T03:52:01Z) - Clarify When Necessary: Resolving Ambiguity Through Interaction with LMs [58.620269228776294]
We propose a task-agnostic framework for resolving ambiguity by asking users clarifying questions.
We evaluate systems across three NLP applications: question answering, machine translation and natural language inference.
We find that intent-sim is robust, demonstrating improvements across a wide range of NLP tasks and LMs.
arXiv Detail & Related papers (2023-11-16T00:18:50Z) - Learning to Filter Context for Retrieval-Augmented Generation [75.18946584853316]
Generation models are required to generate outputs given partially or entirely irrelevant passages.
FILCO identifies useful context based on lexical and information-theoretic approaches.
It trains context filtering models that can filter retrieved contexts at test time.
arXiv Detail & Related papers (2023-11-14T18:41:54Z) - Knowledge-Based Counterfactual Queries for Visual Question Answering [0.0]
We propose a systematic method for explaining the behavior and investigating the robustness of VQA models through counterfactual perturbations.
For this reason, we exploit structured knowledge bases to perform deterministic, optimal and controllable word-level replacements targeting the linguistic modality.
We then evaluate the model's response against such counterfactual inputs.
arXiv Detail & Related papers (2023-03-05T08:00:30Z) - Context Modeling with Evidence Filter for Multiple Choice Question
Answering [18.154792554957595]
Multiple-Choice Question Answering (MCQA) is a challenging task in machine reading comprehension.
The main challenge is to extract "evidence" from the given context that supports the correct answer.
Existing work tackles this problem by annotated evidence or distant supervision with rules which overly rely on human efforts.
We propose a simple yet effective approach termed evidence filtering to model the relationships between the encoded contexts.
arXiv Detail & Related papers (2020-10-06T11:53:23Z) - A Revised Generative Evaluation of Visual Dialogue [80.17353102854405]
We propose a revised evaluation scheme for the VisDial dataset.
We measure consensus between answers generated by the model and a set of relevant answers.
We release these sets and code for the revised evaluation scheme as DenseVisDial.
arXiv Detail & Related papers (2020-04-20T13:26:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.