Evaluating Hallucinations in Chinese Large Language Models
- URL: http://arxiv.org/abs/2310.03368v4
- Date: Wed, 25 Oct 2023 07:49:37 GMT
- Title: Evaluating Hallucinations in Chinese Large Language Models
- Authors: Qinyuan Cheng, Tianxiang Sun, Wenwei Zhang, Siyin Wang, Xiangyang Liu,
Mozhi Zhang, Junliang He, Mianqiu Huang, Zhangyue Yin, Kai Chen, Xipeng Qiu
- Abstract summary: We establish a benchmark named HalluQA (Chinese Hallucination Question-Answering) to measure the hallucination phenomenon in Chinese large language models.
We consider two types of hallucinations: imitative falsehoods and factual errors, and we construct adversarial samples based on GLM-130B and ChatGPT.
For evaluation, we design an automated evaluation method using GPT-4 to judge whether a model output is hallucinated.
- Score: 65.4771562909392
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we establish a benchmark named HalluQA (Chinese Hallucination
Question-Answering) to measure the hallucination phenomenon in Chinese large
language models. HalluQA contains 450 meticulously designed adversarial
questions, spanning multiple domains, and takes into account Chinese historical
culture, customs, and social phenomena. During the construction of HalluQA, we
consider two types of hallucinations: imitative falsehoods and factual errors,
and we construct adversarial samples based on GLM-130B and ChatGPT. For
evaluation, we design an automated evaluation method using GPT-4 to judge
whether a model output is hallucinated. We conduct extensive experiments on 24
large language models, including ERNIE-Bot, Baichuan2, ChatGLM, Qwen, SparkDesk
and etc. Out of the 24 models, 18 achieved non-hallucination rates lower than
50%. This indicates that HalluQA is highly challenging. We analyze the primary
types of hallucinations in different types of models and their causes.
Additionally, we discuss which types of hallucinations should be prioritized
for different types of models.
Related papers
- Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs [54.50483041708911]
Hallu-PI is the first benchmark designed to evaluate hallucination in MLLMs within Perturbed Inputs.
Hallu-PI consists of seven perturbed scenarios, containing 1,260 perturbed images from 11 object types.
Our research reveals a severe bias in MLLMs' ability to handle different types of hallucinations.
arXiv Detail & Related papers (2024-08-02T16:07:15Z) - Knowledge Overshadowing Causes Amalgamated Hallucination in Large Language Models [65.32990889402927]
We coin this phenomenon as knowledge overshadowing''
We show that the hallucination rate grows with both the imbalance ratio and the length of dominant condition description.
We propose to utilize overshadowing conditions as a signal to catch hallucination before it is produced.
arXiv Detail & Related papers (2024-07-10T20:37:42Z) - VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)
VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z) - On Large Language Models' Hallucination with Regard to Known Facts [74.96789694959894]
Large language models are successful in answering factoid questions but are also prone to hallucination.
We investigate the phenomenon of LLMs possessing correct answer knowledge yet still hallucinating from the perspective of inference dynamics.
Our study shed light on understanding the reasons for LLMs' hallucinations on their known facts, and more importantly, on accurately predicting when they are hallucinating.
arXiv Detail & Related papers (2024-03-29T06:48:30Z) - Fine-grained Hallucination Detection and Editing for Language Models [109.56911670376932]
Large language models (LMs) are prone to generate factual errors, which are often called hallucinations.
We introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms.
We propose a novel task of automatic fine-grained hallucination detection and construct a new evaluation benchmark, FavaBench.
arXiv Detail & Related papers (2024-01-12T19:02:48Z) - Understanding and Detecting Hallucinations in Neural Machine Translation
via Model Introspection [28.445196622710164]
We first identify internal model symptoms of hallucinations by analyzing the relative token contributions to the generation in contrastive hallucinated vs. non-hallucinated outputs generated via source perturbations.
We then show that these symptoms are reliable indicators of natural hallucinations, by using them to design a lightweight hallucination detector.
arXiv Detail & Related papers (2023-01-18T20:43:13Z) - On the Origin of Hallucinations in Conversational Models: Is it the
Datasets or the Models? [32.41234580068662]
We conduct a study on existing knowledge-grounded conversational benchmarks and several state-of-the-art models.
Standard benchmarks consist of >60% hallucinated responses, leading to models that not only hallucinate but even amplify hallucinations.
Our findings raise important questions on the quality of existing datasets and models trained using them.
arXiv Detail & Related papers (2022-04-17T05:15:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.