Related papers: RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models

RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models

URL: http://arxiv.org/abs/2401.00396v2
Date: Fri, 17 May 2024 06:29:31 GMT
Title: RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models
Authors: Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, Tong Zhang,
Abstract summary: Retrieval-augmented generation (RAG) has become a main technique for alleviating hallucinations in large language models (LLMs) This paper presents RAGTruth, a corpus tailored for analyzing word-level hallucinations in various domains.
Score: 9.465753274663061
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Retrieval-augmented generation (RAG) has become a main technique for alleviating hallucinations in large language models (LLMs). Despite the integration of RAG, LLMs may still present unsupported or contradictory claims to the retrieved contents. In order to develop effective hallucination prevention strategies under RAG, it is important to create benchmark datasets that can measure the extent of hallucination. This paper presents RAGTruth, a corpus tailored for analyzing word-level hallucinations in various domains and tasks within the standard RAG frameworks for LLM applications. RAGTruth comprises nearly 18,000 naturally generated responses from diverse LLMs using RAG. These responses have undergone meticulous manual annotations at both the individual cases and word levels, incorporating evaluations of hallucination intensity. We not only benchmark hallucination frequencies across different LLMs, but also critically assess the effectiveness of several existing hallucination detection methodologies. Furthermore, we show that using a high-quality dataset such as RAGTruth, it is possible to finetune a relatively small LLM and achieve a competitive level of performance in hallucination detection when compared to the existing prompt-based approaches using state-of-the-art large language models such as GPT-4.

Related papers

LUMINA: Detecting Hallucinations in RAG System with Context-Knowledge Signals [7.61196995380844]
Retrieval-Augmented Generation (RAG) aims to mitigate hallucinations in large language models (LLMs) by grounding responses in retrieved documents.<n>Yet, RAG-based LLMs still hallucinate even when provided with correct and sufficient context.<n>We propose LUMINA, a novel framework that detects hallucinations in RAG systems through context-knowledge signals.
arXiv Detail & Related papers (2025-09-26T04:57:46Z)
Beyond ROUGE: N-Gram Subspace Features for LLM Hallucination Detection [5.0106565473767075]
Large Language Models (LLMs) have demonstrated effectiveness across a wide variety of tasks involving natural language.<n>A fundamental problem of hallucinations still plagues these models, limiting their trustworthiness in generating consistent, truthful information.<n>We propose a novel approach inspired by ROUGE that constructs an N-Gram frequency tensor from LLM-generated text.<n>This tensor captures richer semantic structure by encoding co-occurrence patterns, enabling better differentiation between factual and hallucinated content.
arXiv Detail & Related papers (2025-09-03T18:52:24Z)
MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM [58.2298313720146]
Multimodal hallucinations are multi-sourced and arise from diverse causes.<n>Existing benchmarks fail to adequately distinguish between perception-induced hallucinations and reasoning-induced hallucinations.
arXiv Detail & Related papers (2025-05-30T05:54:36Z)
Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards [35.25220876573924]
Retrieval-augmented generation (RAG) aims to reduce hallucinations by grounding responses in external context.<n>LLMs still frequently introduce unsupported information or contradictions even when provided with relevant context.<n>This paper presents two complementary efforts at Vectara to measure and benchmark LLM faithfulness in RAG.
arXiv Detail & Related papers (2025-05-07T22:50:33Z)
Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models [20.175106988135454]
We introduce a novel Attention-Guided SElf-Reflection (AGSER) approach for zero-shot hallucination detection in Large Language Models (LLMs) The AGSER method utilizes attention contributions to categorize the input query into attentive and non-attentive queries. In addition to its efficacy in detecting hallucinations, AGSER notably reduces computational overhead, requiring only three passes through the LLM and utilizing two sets of tokens.
arXiv Detail & Related papers (2025-01-17T07:30:01Z)
Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM Chatbots [0.0]
I combine detection and mitigation techniques to addresses hallucinations in Large Language Models (LLMs) Mitigation is achieved in a question-answer-ing Retrieval-Augmented Generation (RAG) framework while detection is obtained by introducing the Negative Missing Information Scoring System (NMISS) This combined approach offers new insights into the reduction and more accurate assessment of hallucinations in LLMs, with applications in real-world healthcare tasks and other domains.
arXiv Detail & Related papers (2024-12-05T15:11:12Z)
Investigating the Role of Prompting and External Tools in Hallucination Rates of Large Language Models [0.0]
Large Language Models (LLMs) are powerful computational models trained on extensive corpora of human-readable text, enabling them to perform general-purpose language understanding and generation. Despite these successes, LLMs often produce inaccuracies, commonly referred to as hallucinations. This paper provides an empirical evaluation of different prompting strategies and frameworks aimed at reducing hallucinations in LLMs.
arXiv Detail & Related papers (2024-10-25T08:34:53Z)
ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models [15.156359255401812]
This paper introduces ODE, an open-set, dynamic protocol for evaluating object existence hallucinations in large language models (MLLMs) Our framework employs graph structures to model associations between real-word concepts and generates novel samples for both general and domain-specific scenarios. Experimental results show that MLLMs exhibit higher hallucination rates with ODE-generated samples, effectively avoiding data contamination.
arXiv Detail & Related papers (2024-09-14T05:31:29Z)
LRP4RAG: Detecting Hallucinations in Retrieval-Augmented Generation via Layer-wise Relevance Propagation [3.3762582927663063]
In this paper, we propose LRP4RAG, a method for detecting hallucinations in large language models (LLMs) To the best of our knowledge, this is the first time that LRP has been used for detecting RAG hallucinations.
arXiv Detail & Related papers (2024-08-28T04:44:43Z)
ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models [65.12177400764506]
Large language models (LLMs) exhibit hallucinations in long-form question-answering tasks across various domains and wide applications. Current hallucination detection and mitigation datasets are limited in domains and sizes. This paper introduces an iterative self-training framework that simultaneously and progressively scales up the hallucination annotation dataset.
arXiv Detail & Related papers (2024-07-05T17:56:38Z)
Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models [70.19081534515371]
Large Language Models (LLMs) have gained widespread adoption in various natural language processing tasks. They generate unfaithful or inconsistent content that deviates from the input source, leading to severe consequences. We propose a robust discriminator named RelD to effectively detect hallucination in LLMs' generated answers.
arXiv Detail & Related papers (2024-07-04T18:47:42Z)
HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild [41.86776426516293]
Hallucinations pose a significant challenge to the reliability of large language models (LLMs) in critical domains. We introduce HaluEval-Wild, the first benchmark specifically designed to evaluate LLM hallucinations in the wild.
arXiv Detail & Related papers (2024-03-07T08:25:46Z)
Alleviating Hallucinations of Large Language Models through Induced Hallucinations [67.35512483340837]
Large language models (LLMs) have been observed to generate responses that include inaccurate or fabricated information. We propose a simple textitInduce-then-Contrast Decoding (ICD) strategy to alleviate hallucinations.
arXiv Detail & Related papers (2023-12-25T12:32:49Z)
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation [58.19101663976327]
Multi-modal Large Language Models (MLLMs) encounter the significant challenge of hallucinations. evaluating MLLMs' hallucinations is becoming increasingly important in model improvement and practical application deployment. We propose an LLM-free multi-dimensional benchmark AMBER, which can be used to evaluate both generative task and discriminative task.
arXiv Detail & Related papers (2023-11-13T15:25:42Z)
AutoHall: Automated Hallucination Dataset Generation for Large Language Models [56.92068213969036]
This paper introduces a method for automatically constructing model-specific hallucination datasets based on existing fact-checking datasets called AutoHall. We also propose a zero-resource and black-box hallucination detection method based on self-contradiction.
arXiv Detail & Related papers (2023-09-30T05:20:02Z)
Benchmarking Large Language Models in Retrieval-Augmented Generation [53.504471079548]
We systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG. We establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese.
arXiv Detail & Related papers (2023-09-04T08:28:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.