HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations
- URL: http://arxiv.org/abs/2503.07833v1
- Date: Mon, 10 Mar 2025 20:24:07 GMT
- Title: HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations
- Authors: Samir Abdaljalil, Hasan Kurban, Erchin Serpedin,
- Abstract summary: We introduce HalluVerse25, a multilingual dataset that categorizes fine-grained hallucinations in English, Arabic, and Turkish.<n>Our dataset construction pipeline uses an LLM to inject hallucinations into factual biographical sentences, followed by a rigorous human annotation process to ensure data quality.
- Score: 2.3732122943029164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are increasingly used in various contexts, yet remain prone to generating non-factual content, commonly referred to as "hallucinations". The literature categorizes hallucinations into several types, including entity-level, relation-level, and sentence-level hallucinations. However, existing hallucination datasets often fail to capture fine-grained hallucinations in multilingual settings. In this work, we introduce HalluVerse25, a multilingual LLM hallucination dataset that categorizes fine-grained hallucinations in English, Arabic, and Turkish. Our dataset construction pipeline uses an LLM to inject hallucinations into factual biographical sentences, followed by a rigorous human annotation process to ensure data quality. We evaluate several LLMs on HalluVerse25, providing valuable insights into how proprietary models perform in detecting LLM-generated hallucinations across different contexts.
Related papers
- How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild [11.82100047858478]
hallucination is the tendency of Large Language Models to generate non-factual or unfaithful responses.<n>We train a multilingual hallucination detection model and conduct a large-scale study across 30 languages.<n>We find that while LLMs generate longer responses with more hallucinated tokens for higher-resource languages, there is no correlation between length-normalized hallucination rates of languages and their digital representation.
arXiv Detail & Related papers (2025-02-18T11:32:43Z) - Multilingual Hallucination Gaps in Large Language Models [5.505634045241288]
We study the phenomenon of hallucinations across multiple languages in freeform text generation.
These gaps reflect differences in the frequency of hallucinated answers depending on the prompt and language used.
Our results reveal variations in hallucination rates, especially between high and low resource languages.
arXiv Detail & Related papers (2024-10-23T20:41:51Z) - LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models [96.64960606650115]
LongHalQA is an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text.
LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios.
arXiv Detail & Related papers (2024-10-13T18:59:58Z) - ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models [65.12177400764506]
Large language models (LLMs) exhibit hallucinations in long-form question-answering tasks across various domains and wide applications.<n>Current hallucination detection and mitigation datasets are limited in domains and sizes.<n>This paper introduces an iterative self-training framework that simultaneously and progressively scales up the hallucination annotation dataset.
arXiv Detail & Related papers (2024-07-05T17:56:38Z) - ANAH: Analytical Annotation of Hallucinations in Large Language Models [65.12177400764506]
We present $textbfANAH$, a dataset that offers $textbfAN$alytical $textbfA$nnotation of hallucinations in Large Language Models.
ANAH consists of 12k sentence-level annotations for 4.3k LLM responses covering over 700 topics, constructed by a human-in-the-loop pipeline.
Thanks to the fine granularity of the hallucination annotations, we can quantitatively confirm that the hallucinations of LLMs accumulate in the answer and use ANAH to train and evaluate hallucination annotators.
arXiv Detail & Related papers (2024-05-30T17:54:40Z) - Mitigating Object Hallucination in MLLMs via Data-augmented Phrase-level Alignment [52.43197107069751]
Multimodal Large Language Models (MLLMs) often generate factually inaccurate information, referred to as hallucination.
We introduce Data-augmented Phrase-level Alignment (DPA), a novel loss which can be applied to instruction-tuned off-the-shelf MLLMs to mitigate hallucinations.
arXiv Detail & Related papers (2024-05-28T23:36:00Z) - Hallucination Diversity-Aware Active Learning for Text Summarization [46.00645048690819]
Large Language Models (LLMs) have shown propensity to generate hallucinated outputs, i.e., texts that are factually incorrect or unsupported.
Existing methods for alleviating hallucinations typically require costly human annotations to identify and correct hallucinations in LLM outputs.
We propose the first active learning framework to alleviate LLM hallucinations, reducing costly human annotations of hallucination needed.
arXiv Detail & Related papers (2024-04-02T02:30:27Z) - DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models [26.289847386286446]
We propose DiaHalu, the first dialogue-level hallucination evaluation benchmark to our knowledge.
We integrate the collected topics into system prompts and facilitate a dialogue between two ChatGPT3.5.
We manually modify the contents that do not adhere to human language conventions and then have LLMs re-generate, simulating authentic human-machine interaction scenarios.
arXiv Detail & Related papers (2024-03-01T15:38:55Z) - Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models [35.45859414670449]
We introduce a refined taxonomy of hallucinations, featuring a new category: Event Hallucination.
We then utilize advanced LLMs to generate and filter fine grained hallucinatory data consisting of various types of hallucinations.
The proposed benchmark distinctively assesses LVLMs ability to tackle a broad spectrum of hallucinations.
arXiv Detail & Related papers (2024-02-24T05:14:52Z) - The Dawn After the Dark: An Empirical Study on Factuality Hallucination
in Large Language Models [134.6697160940223]
hallucination poses great challenge to trustworthy and reliable deployment of large language models.
Three key questions should be well studied: how to detect hallucinations (detection), why do LLMs hallucinate (source), and what can be done to mitigate them.
This work presents a systematic empirical study on LLM hallucination, focused on the the three aspects of hallucination detection, source and mitigation.
arXiv Detail & Related papers (2024-01-06T12:40:45Z) - HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large
Language Models [146.87696738011712]
Large language models (LLMs) are prone to generate hallucinations, i.e., content that conflicts with the source or cannot be verified by the factual knowledge.
To understand what types of content and to which extent LLMs are apt to hallucinate, we introduce the Hallucination Evaluation benchmark for Large Language Models (HaluEval)
arXiv Detail & Related papers (2023-05-19T15:36:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.