Related papers: DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models

DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models

URL: http://arxiv.org/abs/2403.00896v2
Date: Mon, 17 Jun 2024 15:11:20 GMT
Title: DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models
Authors: Kedi Chen, Qin Chen, Jie Zhou, Yishen He, Liang He,
Abstract summary: We propose DiaHalu, the first dialogue-level hallucination evaluation benchmark to our knowledge. We integrate the collected topics into system prompts and facilitate a dialogue between two ChatGPT3.5. We manually modify the contents that do not adhere to human language conventions and then have LLMs re-generate, simulating authentic human-machine interaction scenarios.
Score: 26.289847386286446
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Since large language models (LLMs) achieve significant success in recent years, the hallucination issue remains a challenge, numerous benchmarks are proposed to detect the hallucination. Nevertheless, some of these benchmarks are not naturally generated by LLMs but are intentionally induced. Also, many merely focus on the factuality hallucination while ignoring the faithfulness hallucination. Additionally, although dialogue pattern is more widely utilized in the era of LLMs, current benchmarks only concentrate on sentence-level and passage-level hallucination. In this study, we propose DiaHalu, the first dialogue-level hallucination evaluation benchmark to our knowledge. Initially, we integrate the collected topics into system prompts and facilitate a dialogue between two ChatGPT3.5. Subsequently, we manually modify the contents that do not adhere to human language conventions and then have LLMs re-generate, simulating authentic human-machine interaction scenarios. Finally, professional scholars annotate all the samples in the dataset. DiaHalu covers four common multi-turn dialogue domains and five hallucination subtypes, extended from factuality and faithfulness hallucination. Experiments through some well-known LLMs and detection methods on the dataset show that DiaHalu is a challenging benchmark, holding significant value for further research.

Related papers

Beyond Facts: Evaluating Intent Hallucination in Large Language Models [13.315302240710164]
FAITHQA is a novel benchmark for intent hallucination that contains 20,068 problems.<n>We find that intent hallucination is a common issue even for state-of-the-art models.<n>We introduce an automatic LLM generation evaluation metric, CONSTRAINT SCORE, for detecting intent hallucination.
arXiv Detail & Related papers (2025-06-06T21:10:55Z)
HalluLens: LLM Hallucination Benchmark [49.170128733508335]
Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as "hallucination" This paper introduces a comprehensive hallucination benchmark, incorporating both new extrinsic and existing intrinsic evaluation tasks.
arXiv Detail & Related papers (2025-04-24T13:40:27Z)
HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations [2.3732122943029164]
We introduce HalluVerse25, a multilingual dataset that categorizes fine-grained hallucinations in English, Arabic, and Turkish. Our dataset construction pipeline uses an LLM to inject hallucinations into factual biographical sentences, followed by a rigorous human annotation process to ensure data quality.
arXiv Detail & Related papers (2025-03-10T20:24:07Z)
Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning [151.4060202671114]
multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing vision-language tasks. This paper introduces a novel bottom-up reasoning framework to address hallucinations in MLLMs. Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge.
arXiv Detail & Related papers (2024-12-15T09:10:46Z)
Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models [22.996176483599868]
We design a unified framework to measure object and relation hallucination in Large Vision-Language Models (LVLMs) simultaneously. Based on our framework, we introduce Tri-HE, a novel Triplet-level Hallucination Evaluation benchmark.
arXiv Detail & Related papers (2024-10-30T15:25:06Z)
FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs [2.871226288151562]
This paper introduces FaithBench, a summarization hallucination benchmark comprising challenging hallucinations made by 10 modern LLMs. Our results show GPT-4o and GPT-3.5-Turbo produce the least hallucinations. Even the best hallucination detection models have near 50% accuracies on FaithBench, indicating lots of room for future improvement.
arXiv Detail & Related papers (2024-10-17T04:30:46Z)
LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models [96.64960606650115]
LongHalQA is an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text. LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios.
arXiv Detail & Related papers (2024-10-13T18:59:58Z)
ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models [65.12177400764506]
Large language models (LLMs) exhibit hallucinations in long-form question-answering tasks across various domains and wide applications. Current hallucination detection and mitigation datasets are limited in domains and sizes. This paper introduces an iterative self-training framework that simultaneously and progressively scales up the hallucination annotation dataset.
arXiv Detail & Related papers (2024-07-05T17:56:38Z)
HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation [19.318217051269382]
Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP) HalluDial is the first comprehensive large-scale benchmark for automatic dialogue-level hallucination evaluation. The benchmark includes 4,094 dialogues with a total of 146,856 samples.
arXiv Detail & Related papers (2024-06-11T08:56:18Z)
ANAH: Analytical Annotation of Hallucinations in Large Language Models [65.12177400764506]
We present $textbfANAH$, a dataset that offers $textbfAN$alytical $textbfA$nnotation of hallucinations in Large Language Models. ANAH consists of 12k sentence-level annotations for 4.3k LLM responses covering over 700 topics, constructed by a human-in-the-loop pipeline. Thanks to the fine granularity of the hallucination annotations, we can quantitatively confirm that the hallucinations of LLMs accumulate in the answer and use ANAH to train and evaluate hallucination annotators.
arXiv Detail & Related papers (2024-05-30T17:54:40Z)
The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models [134.6697160940223]
hallucination poses great challenge to trustworthy and reliable deployment of large language models. Three key questions should be well studied: how to detect hallucinations (detection), why do LLMs hallucinate (source), and what can be done to mitigate them. This work presents a systematic empirical study on LLM hallucination, focused on the the three aspects of hallucination detection, source and mitigation.
arXiv Detail & Related papers (2024-01-06T12:40:45Z)
AutoHall: Automated Hallucination Dataset Generation for Large Language Models [56.92068213969036]
This paper introduces a method for automatically constructing model-specific hallucination datasets based on existing fact-checking datasets called AutoHall. We also propose a zero-resource and black-box hallucination detection method based on self-contradiction.
arXiv Detail & Related papers (2023-09-30T05:20:02Z)
HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models [146.87696738011712]
Large language models (LLMs) are prone to generate hallucinations, i.e., content that conflicts with the source or cannot be verified by the factual knowledge. To understand what types of content and to which extent LLMs are apt to hallucinate, we introduce the Hallucination Evaluation benchmark for Large Language Models (HaluEval)
arXiv Detail & Related papers (2023-05-19T15:36:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.