Related papers: Characterizing LLM Abstention Behavior in Science QA with Context Perturbations

Characterizing LLM Abstention Behavior in Science QA with Context Perturbations

URL: http://arxiv.org/abs/2404.12452v2
Date: Sun, 06 Oct 2024 05:55:28 GMT
Title: Characterizing LLM Abstention Behavior in Science QA with Context Perturbations
Authors: Bingbing Wen, Bill Howe, Lucy Lu Wang,
Abstract summary: We study the ability of LLMs to abstain from answering science questions when provided insufficient or incorrect context. We show that performance varies greatly across models, across the type of context provided, and also by question type. Our results imply that changes are needed in QA dataset design and evaluation to more effectively assess the correctness and downstream impacts of model abstention.
Score: 13.897212714309548
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The correct model response in the face of uncertainty is to abstain from answering a question so as not to mislead the user. In this work, we study the ability of LLMs to abstain from answering context-dependent science questions when provided insufficient or incorrect context. We probe model sensitivity in several settings: removing gold context, replacing gold context with irrelevant context, and providing additional context beyond what is given. In experiments on four QA datasets with six LLMs, we show that performance varies greatly across models, across the type of context provided, and also by question type; in particular, many LLMs seem unable to abstain from answering boolean questions using standard QA prompts. Our analysis also highlights the unexpected impact of abstention performance on QA task accuracy. Counter-intuitively, in some settings, replacing gold context with irrelevant context or adding irrelevant context to gold context can improve abstention performance in a way that results in improvements in task performance. Our results imply that changes are needed in QA dataset design and evaluation to more effectively assess the correctness and downstream impacts of model abstention.

Related papers

Query Disambiguation via Answer-Free Context: Doubling Performance on Humanity's Last Exam [6.1512837277903785]
This work investigates how the quality of background grounding information in a model's context window affects accuracy.<n>We find that combining well-grounded dynamic context construction (i.e., RAG) with query rewriting reduces question ambiguity, resulting in significant accuracy gains.
arXiv Detail & Related papers (2026-02-27T19:05:25Z)
Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance [6.511402661783843]
Large language models (LLMs) perform well on well-posed questions, yet standard question-answering (QA) benchmarks remain far from solved.<n>We argue that this gap is partly due to underspecified questions - queries whose interpretation cannot be uniquely determined without additional context.
arXiv Detail & Related papers (2026-02-12T13:36:23Z)
Inferential Question Answering [67.54465021408724]
We introduce Inferential QA -- a new task that challenges models to infer answers from answer-supporting passages which provide only clues.<n>To study this problem, we construct QUIT (QUestions requiring Inference from Texts) dataset, comprising 7,401 questions and 2.4M passages.<n>We show that methods effective on traditional QA tasks struggle in inferential QA: retrievers underperform, rerankers offer limited gains, and fine-tuning provides inconsistent improvements.
arXiv Detail & Related papers (2026-02-01T14:02:43Z)
LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization [9.410181019585822]
We operationalize interpretability methods to ascertain whether we can predict the correctness of model outputs.<n>We consider correct, incorrect, and irrelevant context and introduce metrics to distinguish amongst them.<n>Our model-internals-based metric significantly outperforms prompting baselines at distinguishing between correct and incorrect context.
arXiv Detail & Related papers (2025-10-05T03:14:05Z)
Context Engineering for Trustworthiness: Rescorla Wagner Steering Under Mixed and Inappropriate Contexts [55.70338710797578]
We introduce the Poisoned Context Testbed, pairing queries with real-world contexts containing relevant and inappropriate content.<n>Inspired by associative learning in animals, we adapt the Rescorla-Wagner (RW) model from neuroscience to quantify how competing contextual signals influence LLM outputs.<n>We introduce RW-Steering, a two-stage finetuning-based approach that enables the model to internally identify and ignore inappropriate signals.
arXiv Detail & Related papers (2025-09-02T00:40:34Z)
Revisiting LLM Value Probing Strategies: Are They Robust and Expressive? [81.49470136653665]
We evaluate the robustness and expressiveness of value representations across three widely used probing strategies.<n>We show that the demographic context has little effect on the free-text generation, and the models' values only weakly correlate with their preference for value-based actions.
arXiv Detail & Related papers (2025-07-17T18:56:41Z)
Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation [52.3707788779464]
We introduce a novel Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD)<n>ARC-JSD enables efficient and accurate identification of essential context sentences without additional fine-tuning, gradient-calculation or surrogate modelling.<n> Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements.
arXiv Detail & Related papers (2025-05-22T09:04:03Z)
DeCAP: Context-Adaptive Prompt Generation for Debiasing Zero-shot Question Answering in Large Language Models [14.739041141948036]
Large Language Models (LLMs) excel in zero-shot Question Answering (QA) LLMs tend to expose biases in their internal knowledge when faced with socially sensitive questions. We propose DeCAP, a method for debiasing LLMs using Context-Adaptive Prompt Generation.
arXiv Detail & Related papers (2025-03-25T08:16:35Z)
Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models [69.68265487134686]
Video SimpleQA is the first comprehensive benchmark tailored for factuality evaluation of LVLMs. Our work distinguishes from existing video benchmarks through the following key features. Answers are crafted as unambiguous and definitively correct in a short format.
arXiv Detail & Related papers (2025-03-24T17:46:09Z)
CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering [6.297950040983263]
Large language models (LLMs) are prone to hallucinations in question-answering (QA) tasks when faced with ambiguous questions. We introduce Conditional Ambiguous Question-Answering (CondAmbigQA), a benchmark with 200 ambiguous queries. Our study pioneers the concept of conditions'' in ambiguous QA tasks, where conditions stand for contextual constraints or assumptions that resolve ambiguities.
arXiv Detail & Related papers (2025-02-03T17:01:51Z)
Sufficient Context: A New Lens on Retrieval Augmented Generation Systems [19.238772793096473]
Augmenting LLMs with context leads to improved performance across many applications. We develop a new notion of sufficient context, along with a way to classify instances that have enough information to answer the query. We find that proprietary LLMs excel at answering queries when the context is sufficient, but often output incorrect answers instead of abstaining when the context is not.
arXiv Detail & Related papers (2024-11-09T02:13:14Z)
AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses [26.850344968677582]
We propose a method that leverages large language models to evaluate answers to open-ended questions. We conducted experiments on four datasets using both ChatGPT-3.5-turbo and GPT-4. Our results indicate that our approach more closely aligns with human judgment compared to the four baselines.
arXiv Detail & Related papers (2024-10-02T05:22:07Z)
Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context. We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions. We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z)
RAGGED: Towards Informed Design of Retrieval Augmented Generation Systems [51.171355532527365]
Retrieval-augmented generation (RAG) can significantly improve the performance of language models (LMs) RAGGED is a framework for analyzing RAG configurations across various document-based question answering tasks.
arXiv Detail & Related papers (2024-03-14T02:26:31Z)
FusionMind -- Improving question and answering with external context fusion [0.0]
We studied the impact of contextual knowledge on the Question-answering (QA) objective using pre-trained language models (LMs) and knowledge graphs (KGs) We found that incorporating knowledge facts context led to a significant improvement in performance. This suggests that the integration of contextual knowledge facts may be more impactful for enhancing question answering performance.
arXiv Detail & Related papers (2023-12-31T03:51:31Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
Towards leveraging LLMs for Conditional QA [1.9649272351760063]
This study delves into the capabilities and limitations of Large Language Models (LLMs) in the challenging domain of conditional question-answering. Our findings reveal that fine-tuned LLMs can surpass the state-of-the-art (SOTA) performance in some cases, even without fully encoding all input context. These models encounter challenges in extractive question answering, where they lag behind the SOTA by over 10 points, and in mitigating the risk of injecting false information.
arXiv Detail & Related papers (2023-12-02T14:02:52Z)
Learning to Filter Context for Retrieval-Augmented Generation [75.18946584853316]
Generation models are required to generate outputs given partially or entirely irrelevant passages. FILCO identifies useful context based on lexical and information-theoretic approaches. It trains context filtering models that can filter retrieved contexts at test time.
arXiv Detail & Related papers (2023-11-14T18:41:54Z)
Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model. We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z)
Making Retrieval-Augmented Language Models Robust to Irrelevant Context [55.564789967211844]
An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant. Recent work has shown that retrieval augmentation can sometimes have a negative effect on performance.
arXiv Detail & Related papers (2023-10-02T18:52:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.