Related papers: FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs

FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs

URL: http://arxiv.org/abs/2410.13210v1
Date: Thu, 17 Oct 2024 04:30:46 GMT
Title: FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs
Authors: Forrest Sheng Bao, Miaoran Li, Renyi Qu, Ge Luo, Erana Wan, Yujia Tang, Weisi Fan, Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, Mike Qi, Ruixuan Tu, Chenyu Xu, Matthew Gonzales, Ofer Mendelevitch, Amin Ahmad,
Abstract summary: This paper introduces FaithBench, a summarization hallucination benchmark comprising challenging hallucinations made by 10 modern LLMs. Our results show GPT-4o and GPT-3.5-Turbo produce the least hallucinations. Even the best hallucination detection models have near 50% accuracies on FaithBench, indicating lots of room for future improvement.
Score: 2.871226288151562
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Summarization is one of the most common tasks performed by large language models (LLMs), especially in applications like Retrieval-Augmented Generation (RAG). However, existing evaluations of hallucinations in LLM-generated summaries, and evaluations of hallucination detection models both suffer from a lack of diversity and recency in the LLM and LLM families considered. This paper introduces FaithBench, a summarization hallucination benchmark comprising challenging hallucinations made by 10 modern LLMs from 8 different families, with ground truth annotations by human experts. ``Challenging'' here means summaries on which popular, state-of-the-art hallucination detection models, including GPT-4o-as-a-judge, disagreed on. Our results show GPT-4o and GPT-3.5-Turbo produce the least hallucinations. However, even the best hallucination detection models have near 50\% accuracies on FaithBench, indicating lots of room for future improvement. The repo is https://github.com/vectara/FaithBench

Related papers

Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards [34.14529094908449]
This paper presents our efforts to measure hallucinations with a focus on summarization tasks.<n>We discuss Vectara's existing LLM hallucination leaderboard, based on the Hughes Hallucination Evaluation Model (HHEM)<n>To address these limitations, we propose FaithJudge, an LLM-as-a-judge approach guided by few-shot human hallucination annotations.
arXiv Detail & Related papers (2025-05-07T22:50:33Z)
From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization [6.37435726278524]
We investigate how hallucinations manifest in large language models (LLMs) when summarizing topic-specific information from multiple documents. On average, up to 75% of the content in LLM-generated summary is hallucinated, with hallucinations more likely to occur towards the end of the summaries. To understand the characteristics of these hallucinations, we manually evaluate 700+ insights and find that most errors stem from either failing to follow instructions or producing overly generic insights.
arXiv Detail & Related papers (2024-10-17T18:38:53Z)
MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models [26.464489158584463]
We conduct a pioneering study of hallucinations in LLM-generated responses to real-world healthcare queries from patients. We propose MedHalu, a carefully crafted first-of-its-kind medical hallucination dataset with a diverse range of health-related topics. We also introduce MedHaluDetect framework to evaluate capabilities of various LLMs in detecting hallucinations.
arXiv Detail & Related papers (2024-09-29T00:09:01Z)
WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries [64.239202960816]
We introduce WildHallucinations, a benchmark that evaluates factuality. It does so by prompting large language models to generate information about entities mined from user-chatbot conversations in the wild. We evaluate 118,785 generations from 15 LLMs on 7,919 entities.
arXiv Detail & Related papers (2024-07-24T17:59:05Z)
ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models [65.12177400764506]
Large language models (LLMs) exhibit hallucinations in long-form question-answering tasks across various domains and wide applications. Current hallucination detection and mitigation datasets are limited in domains and sizes. This paper introduces an iterative self-training framework that simultaneously and progressively scales up the hallucination annotation dataset.
arXiv Detail & Related papers (2024-07-05T17:56:38Z)
ANAH: Analytical Annotation of Hallucinations in Large Language Models [65.12177400764506]
We present $textbfANAH$, a dataset that offers $textbfAN$alytical $textbfA$nnotation of hallucinations in Large Language Models. ANAH consists of 12k sentence-level annotations for 4.3k LLM responses covering over 700 topics, constructed by a human-in-the-loop pipeline. Thanks to the fine granularity of the hallucination annotations, we can quantitatively confirm that the hallucinations of LLMs accumulate in the answer and use ANAH to train and evaluate hallucination annotators.
arXiv Detail & Related papers (2024-05-30T17:54:40Z)
DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models [26.289847386286446]
We propose DiaHalu, the first dialogue-level hallucination evaluation benchmark to our knowledge. We integrate the collected topics into system prompts and facilitate a dialogue between two ChatGPT3.5. We manually modify the contents that do not adhere to human language conventions and then have LLMs re-generate, simulating authentic human-machine interaction scenarios.
arXiv Detail & Related papers (2024-03-01T15:38:55Z)
Fine-grained Hallucination Detection and Editing for Language Models [109.56911670376932]
Large language models (LMs) are prone to generate factual errors, which are often called hallucinations. We introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms. We propose a novel task of automatic fine-grained hallucination detection and construct a new evaluation benchmark, FavaBench.
arXiv Detail & Related papers (2024-01-12T19:02:48Z)
The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models [134.6697160940223]
hallucination poses great challenge to trustworthy and reliable deployment of large language models. Three key questions should be well studied: how to detect hallucinations (detection), why do LLMs hallucinate (source), and what can be done to mitigate them. This work presents a systematic empirical study on LLM hallucination, focused on the the three aspects of hallucination detection, source and mitigation.
arXiv Detail & Related papers (2024-01-06T12:40:45Z)
HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models [146.87696738011712]
Large language models (LLMs) are prone to generate hallucinations, i.e., content that conflicts with the source or cannot be verified by the factual knowledge. To understand what types of content and to which extent LLMs are apt to hallucinate, we introduce the Hallucination Evaluation benchmark for Large Language Models (HaluEval)
arXiv Detail & Related papers (2023-05-19T15:36:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.