Who's Who: Large Language Models Meet Knowledge Conflicts in Practice
- URL: http://arxiv.org/abs/2410.15737v1
- Date: Mon, 21 Oct 2024 07:56:45 GMT
- Title: Who's Who: Large Language Models Meet Knowledge Conflicts in Practice
- Authors: Quang Hieu Pham, Hoang Ngo, Anh Tuan Luu, Dat Quoc Nguyen,
- Abstract summary: We introduce WhoQA, a benchmark dataset to examine model's behavior in knowledge conflict situations.
We induce conflicts by asking about a common property among entities having the same name, resulting in questions with up to 8 distinctive answers.
Our experiments show that despite the simplicity of WhoQA questions, knowledge conflicts significantly degrades LLMs' performance in RAG settings.
- Score: 28.48156432356721
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieval-augmented generation (RAG) methods are viable solutions for addressing the static memory limits of pre-trained language models. Nevertheless, encountering conflicting sources of information within the retrieval context is an inevitable practical challenge. In such situations, the language models are recommended to transparently inform users about the conflicts rather than autonomously deciding what to present based on their inherent biases. To analyze how current large language models (LLMs) align with our recommendation, we introduce WhoQA, a public benchmark dataset to examine model's behavior in knowledge conflict situations. We induce conflicts by asking about a common property among entities having the same name, resulting in questions with up to 8 distinctive answers. WhoQA evaluation set includes 5K questions across 13 Wikidata property types and 150K Wikipedia entities. Our experiments show that despite the simplicity of WhoQA questions, knowledge conflicts significantly degrades LLMs' performance in RAG settings.
Related papers
- CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering [53.7094431951084]
Knowledge-based visual question answering (KB-VQA) demonstrates significant potential for handling knowledge-intensive tasks.<n>Conflicts arise between static parametric knowledge in vision language models and dynamically retrieved information.<n>We propose textbfCC-VQA as a training-free, conflict- and correlation-aware method for KB-VQA.
arXiv Detail & Related papers (2026-02-27T11:56:26Z) - That's Deprecated! Understanding, Detecting, and Steering Knowledge Conflicts in Language Models for Code Generation [55.78914774437411]
Large language models (LLMs) behave when faced with discrepancies between their parametric knowledge and conflicting information contained in a prompt.<n>We propose a domain-agnostic framework for constructing and interpreting such conflicts.<n>We show that activation-level steering can achieve up to a textbf12.6% improvement in steering success over a random baseline.
arXiv Detail & Related papers (2025-10-21T22:27:56Z) - Consensus or Conflict? Fine-Grained Evaluation of Conflicting Answers in Question-Answering [22.447638522275092]
Multi-Answer Question Answering (MAQA), where a question may have several valid answers, remains challenging.<n>We introduce a novel cost-effective methodology for leveraging fact-checking datasets to construct NATCONFQA.<n>We evaluate eight high-end LLMs on NATCONFQA, revealing their fragility in handling various types of conflicts.
arXiv Detail & Related papers (2025-08-17T12:58:48Z) - FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation [37.28571879699906]
Large language models (LLMs) augmented with retrieval systems have demonstrated significant potential in handling knowledge-intensive tasks.<n>This paper proposes FaithfulRAG, a novel framework that resolves knowledge conflicts by explicitly modeling discrepancies between the models parametric knowledge and retrieved context.
arXiv Detail & Related papers (2025-06-10T16:02:54Z) - DRAGged into Conflicts: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs [36.47787866482107]
Retrieval Augmented Generation (RAG) is a commonly used approach for enhancing large language models.<n>We propose a novel taxonomy of knowledge conflict types in RAG, along with the desired model behavior for each type.<n>We then introduce CONFLICTS, a high-quality benchmark with expert annotations of conflict types in a realistic RAG setting.
arXiv Detail & Related papers (2025-06-10T06:52:57Z) - What Is Seen Cannot Be Unseen: The Disruptive Effect of Knowledge Conflict on Large Language Models [16.41477610681199]
Large language models frequently rely on both contextual input and parametric knowledge to perform tasks.<n>These sources can come into conflict, especially when retrieved documents contradict the model's parametric beliefs.<n>We propose a diagnostic framework to systematically evaluate LLM behavior under context-memory conflict.
arXiv Detail & Related papers (2025-06-06T19:20:23Z) - Retrieval-Augmented Generation with Conflicting Evidence [57.66282463340297]
Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses.
In practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources.
We propose RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query.
arXiv Detail & Related papers (2025-04-17T16:46:11Z) - Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings [36.449658676568234]
Large language model (LLM)-as-judge paradigm has been used to meet the demand for a cheap, reliable, and fast evaluation of model outputs.
We propose ContextualJudgeBench, a judge benchmark with 2,000 challenging response pairs across eight splits inspired by real-world contextual evaluation scenarios.
Our comprehensive study reveals that the contextual information and its assessment criteria present a significant challenge to even state-of-the-art models.
arXiv Detail & Related papers (2025-03-19T18:09:19Z) - Open Domain Question Answering with Conflicting Contexts [55.739842087655774]
We find that as much as 25% of unambiguous, open domain questions can lead to conflicting contexts when retrieved using Google Search.
We ask our annotators to provide explanations for their selections of correct answers.
arXiv Detail & Related papers (2024-10-16T07:24:28Z) - Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models [33.76903352835436]
Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities for capturing and reasoning over multimodal inputs.
These models are prone to parametric knowledge conflicts, which arise from inconsistencies of represented knowledge between their vision and language components.
We present a systematic approach to detect, interpret, and mitigate them.
arXiv Detail & Related papers (2024-10-04T17:59:28Z) - AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge [57.66282463340297]
Knowledge conflict arises from discrepancies between information in the context of a large language model (LLM) and the knowledge stored in its parameters.
We propose a fine-grained, instance-level approach called AdaCAD, which dynamically infers the weight of adjustment based on the degree of conflict.
arXiv Detail & Related papers (2024-09-11T16:35:18Z) - WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia [59.96425443250666]
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs)
In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions based on contradictory passages from Wikipedia.
We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages.
arXiv Detail & Related papers (2024-06-19T20:13:42Z) - DomainRAG: A Chinese Benchmark for Evaluating Domain-specific Retrieval-Augmented Generation [19.907074685082]
Retrieval-Augmented Generation offers a promising solution to address various limitations of Large Language Models.
Current studies often rely on general knowledge sources like Wikipedia to assess the models' abilities in solving common-sense problems.
We identified six required abilities for RAG models, including the ability in conversational RAG.
arXiv Detail & Related papers (2024-06-09T05:33:51Z) - LLMs' Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements [59.71218039095155]
Task of reading comprehension (RC) provides a primary means to assess language models' natural language understanding (NLU) capabilities.
If the context aligns with the models' internal knowledge, it is hard to discern whether the models' answers stem from context comprehension or from internal information.
To address this issue, we suggest to use RC on imaginary data, based on fictitious facts and entities.
arXiv Detail & Related papers (2024-04-09T13:08:56Z) - Look Before You Leap: A Universal Emergent Decomposition of Retrieval
Tasks in Language Models [58.57279229066477]
We study how language models (LMs) solve retrieval tasks in diverse situations.
We introduce ORION, a collection of structured retrieval tasks spanning six domains.
We find that LMs internally decompose retrieval tasks in a modular way.
arXiv Detail & Related papers (2023-12-13T18:36:43Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Knowledge-Based Counterfactual Queries for Visual Question Answering [0.0]
We propose a systematic method for explaining the behavior and investigating the robustness of VQA models through counterfactual perturbations.
For this reason, we exploit structured knowledge bases to perform deterministic, optimal and controllable word-level replacements targeting the linguistic modality.
We then evaluate the model's response against such counterfactual inputs.
arXiv Detail & Related papers (2023-03-05T08:00:30Z) - Rich Knowledge Sources Bring Complex Knowledge Conflicts: Recalibrating
Models to Reflect Conflicting Evidence [37.18100697469402]
We simulate knowledge conflicts where parametric knowledge suggests one answer and different passages suggest different answers.
We find retrieval performance heavily impacts which sources models rely on, and current models mostly rely on non-performing knowledge.
We present a new calibration study, where models are discouraged from presenting any single answer when presented with multiple conflicting answer candidates.
arXiv Detail & Related papers (2022-10-25T01:46:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.