HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large
Language Models
- URL: http://arxiv.org/abs/2305.11747v3
- Date: Mon, 23 Oct 2023 01:49:32 GMT
- Title: HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large
Language Models
- Authors: Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie and Ji-Rong Wen
- Abstract summary: Large language models (LLMs) are prone to generate hallucinations, i.e., content that conflicts with the source or cannot be verified by the factual knowledge.
To understand what types of content and to which extent LLMs are apt to hallucinate, we introduce the Hallucination Evaluation benchmark for Large Language Models (HaluEval)
- Score: 146.87696738011712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs), such as ChatGPT, are prone to generate
hallucinations, i.e., content that conflicts with the source or cannot be
verified by the factual knowledge. To understand what types of content and to
which extent LLMs are apt to hallucinate, we introduce the Hallucination
Evaluation benchmark for Large Language Models (HaluEval), a large collection
of generated and human-annotated hallucinated samples for evaluating the
performance of LLMs in recognizing hallucination. To generate these samples, we
propose a ChatGPT-based two-step framework, i.e., sampling-then-filtering.
Besides, we also hire some human labelers to annotate the hallucinations in
ChatGPT responses. The empirical results suggest that ChatGPT is likely to
generate hallucinated content in specific topics by fabricating unverifiable
information (i.e., about $19.5\%$ responses). Moreover, existing LLMs face
great challenges in recognizing the hallucinations in texts. However, our
experiments also prove that providing external knowledge or adding reasoning
steps can help LLMs recognize hallucinations. Our benchmark can be accessed at
https://github.com/RUCAIBox/HaluEval.
Related papers
- ANAH: Analytical Annotation of Hallucinations in Large Language Models [65.12177400764506]
We present $textbfANAH$, a dataset that offers $textbfAN$alytical $textbfA$nnotation of hallucinations in Large Language Models.
ANAH consists of 12k sentence-level annotations for 4.3k LLM responses covering over 700 topics, constructed by a human-in-the-loop pipeline.
Thanks to the fine granularity of the hallucination annotations, we can quantitatively confirm that the hallucinations of LLMs accumulate in the answer and use ANAH to train and evaluate hallucination annotators.
arXiv Detail & Related papers (2024-05-30T17:54:40Z) - Hallucination Diversity-Aware Active Learning for Text Summarization [46.00645048690819]
Large Language Models (LLMs) have shown propensity to generate hallucinated outputs, i.e., texts that are factually incorrect or unsupported.
Existing methods for alleviating hallucinations typically require costly human annotations to identify and correct hallucinations in LLM outputs.
We propose the first active learning framework to alleviate LLM hallucinations, reducing costly human annotations of hallucination needed.
arXiv Detail & Related papers (2024-04-02T02:30:27Z) - Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models [35.45859414670449]
We introduce a refined taxonomy of hallucinations, featuring a new category: Event Hallucination.
We then utilize advanced LLMs to generate and filter fine grained hallucinatory data consisting of various types of hallucinations.
The proposed benchmark distinctively assesses LVLMs ability to tackle a broad spectrum of hallucinations.
arXiv Detail & Related papers (2024-02-24T05:14:52Z) - The Dawn After the Dark: An Empirical Study on Factuality Hallucination
in Large Language Models [134.6697160940223]
hallucination poses great challenge to trustworthy and reliable deployment of large language models.
Three key questions should be well studied: how to detect hallucinations (detection), why do LLMs hallucinate (source), and what can be done to mitigate them.
This work presents a systematic empirical study on LLM hallucination, focused on the the three aspects of hallucination detection, source and mitigation.
arXiv Detail & Related papers (2024-01-06T12:40:45Z) - Alleviating Hallucinations of Large Language Models through Induced
Hallucinations [67.35512483340837]
Large language models (LLMs) have been observed to generate responses that include inaccurate or fabricated information.
We propose a simple textitInduce-then-Contrast Decoding (ICD) strategy to alleviate hallucinations.
arXiv Detail & Related papers (2023-12-25T12:32:49Z) - Evaluation and Analysis of Hallucination in Large Vision-Language Models [49.19829480199372]
Large Vision-Language Models (LVLMs) have recently achieved remarkable success.
LVLMs are still plagued by the hallucination problem.
Hallucination refers to the information of LVLMs' responses that does not exist in the visual input.
arXiv Detail & Related papers (2023-08-29T08:51:24Z) - Evaluating Object Hallucination in Large Vision-Language Models [122.40337582958453]
This work presents the first systematic study on object hallucination of large vision-language models (LVLMs)
We find that LVLMs tend to generate objects that are inconsistent with the target images in the descriptions.
We propose a polling-based query method called POPE to evaluate the object hallucination.
arXiv Detail & Related papers (2023-05-17T16:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.