WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries
- URL: http://arxiv.org/abs/2407.17468v1
- Date: Wed, 24 Jul 2024 17:59:05 GMT
- Title: WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries
- Authors: Wenting Zhao, Tanya Goyal, Yu Ying Chiu, Liwei Jiang, Benjamin Newman, Abhilasha Ravichander, Khyathi Chandu, Ronan Le Bras, Claire Cardie, Yuntian Deng, Yejin Choi,
- Abstract summary: We introduce WildHallucinations, a benchmark that evaluates factuality.
It does so by prompting large language models to generate information about entities mined from user-chatbot conversations in the wild.
We evaluate 118,785 generations from 15 LLMs on 7,919 entities.
- Score: 64.239202960816
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While hallucinations of large language models (LLMs) prevail as a major challenge, existing evaluation benchmarks on factuality do not cover the diverse domains of knowledge that the real-world users of LLMs seek information about. To bridge this gap, we introduce WildHallucinations, a benchmark that evaluates factuality. It does so by prompting LLMs to generate information about entities mined from user-chatbot conversations in the wild. These generations are then automatically fact-checked against a systematically curated knowledge source collected from web search. Notably, half of these real-world entities do not have associated Wikipedia pages. We evaluate 118,785 generations from 15 LLMs on 7,919 entities. We find that LLMs consistently hallucinate more on entities without Wikipedia pages and exhibit varying hallucination rates across different domains. Finally, given the same base models, adding a retrieval component only slightly reduces hallucinations but does not eliminate hallucinations.
Related papers
- FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs [2.871226288151562]
This paper introduces FaithBench, a summarization hallucination benchmark comprising challenging hallucinations made by 10 modern LLMs.
Our results show GPT-4o and GPT-3.5-Turbo produce the least hallucinations.
Even the best hallucination detection models have near 50% accuracies on FaithBench, indicating lots of room for future improvement.
arXiv Detail & Related papers (2024-10-17T04:30:46Z) - LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models [96.64960606650115]
LongHalQA is an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text.
LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios.
arXiv Detail & Related papers (2024-10-13T18:59:58Z) - Exploring and Evaluating Hallucinations in LLM-Powered Code Generation [14.438161741833687]
Large Language Models (LLMs) produce outputs that deviate from users' intent, exhibit internal inconsistencies, or misalign with factual knowledge.
Existing work mainly focuses on investing the hallucination in the domain of natural language generation.
We conduct a thematic analysis of the LLM-generated code to summarize and categorize the hallucinations present in it.
We propose HalluCode, a benchmark for evaluating the performance of code LLMs in recognizing hallucinations.
arXiv Detail & Related papers (2024-04-01T07:31:45Z) - Hallucination is Inevitable: An Innate Limitation of Large Language
Models [3.8711997449980844]
We show that it is impossible to eliminate hallucination in large language models.
Since the formal world is a part of the real world which is much more complicated, hallucinations are also inevitable for real world LLMs.
arXiv Detail & Related papers (2024-01-22T10:26:14Z) - Fine-grained Hallucination Detection and Editing for Language Models [109.56911670376932]
Large language models (LMs) are prone to generate factual errors, which are often called hallucinations.
We introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms.
We propose a novel task of automatic fine-grained hallucination detection and construct a new evaluation benchmark, FavaBench.
arXiv Detail & Related papers (2024-01-12T19:02:48Z) - The Dawn After the Dark: An Empirical Study on Factuality Hallucination
in Large Language Models [134.6697160940223]
hallucination poses great challenge to trustworthy and reliable deployment of large language models.
Three key questions should be well studied: how to detect hallucinations (detection), why do LLMs hallucinate (source), and what can be done to mitigate them.
This work presents a systematic empirical study on LLM hallucination, focused on the the three aspects of hallucination detection, source and mitigation.
arXiv Detail & Related papers (2024-01-06T12:40:45Z) - Analyzing and Mitigating Object Hallucination in Large Vision-Language Models [110.12460299261531]
Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages.
LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images.
We propose a powerful algorithm, LVLM Hallucination Revisor (LURE), to rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions.
arXiv Detail & Related papers (2023-10-01T18:10:53Z) - HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large
Language Models [146.87696738011712]
Large language models (LLMs) are prone to generate hallucinations, i.e., content that conflicts with the source or cannot be verified by the factual knowledge.
To understand what types of content and to which extent LLMs are apt to hallucinate, we introduce the Hallucination Evaluation benchmark for Large Language Models (HaluEval)
arXiv Detail & Related papers (2023-05-19T15:36:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.