Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs
- URL: http://arxiv.org/abs/2602.06920v1
- Date: Fri, 06 Feb 2026 18:16:09 GMT
- Title: Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs
- Authors: Samir Abdaljalil, Parichit Sharma, Erchin Serpedin, Hasan Kurban,
- Abstract summary: Halluverse-M3 is a dataset designed to enable systematic analysis of hallucinations across multiple languages.<n>The dataset explicitly distinguishes between entity-level, relation-level, and sentence-level hallucinations.<n>Halluverse-M3 provides a realistic and challenging benchmark for studying hallucinations in multilingual, multi-task settings.
- Score: 2.453830698820308
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hallucinations in large language models remain a persistent challenge, particularly in multilingual and generative settings where factual consistency is difficult to maintain. While recent models show strong performance on English-centric benchmarks, their behavior across languages, tasks, and hallucination types is not yet well understood. In this work, we introduce Halluverse-M^3, a dataset designed to enable systematic analysis of hallucinations across multiple languages, multiple generation tasks, and multiple hallucination categories. Halluverse-M^3 covers four languages, English, Arabic, Hindi, and Turkish, and supports two generation tasks: question answering and dialogue summarization. The dataset explicitly distinguishes between entity-level, relation-level, and sentence-level hallucinations. Hallucinated outputs are constructed through a controlled editing process and validated by human annotators, ensuring clear alignment between original content and hallucinated generations. Using this dataset, we evaluate a diverse set of contemporary open-source and proprietary language models on fine-grained hallucination detection. Our results show that question answering is consistently easier than dialogue summarization, while sentence-level hallucinations remain challenging even for the strongest models. Performance is highest in English and degrades in lower-resource languages, with Hindi exhibiting the lowest detection accuracy. Overall, Halluverse-M^3 provides a realistic and challenging benchmark for studying hallucinations in multilingual, multi-task settings. We release the dataset to support future research on hallucination detection and mitigation\footnote{https://huggingface.co/datasets/sabdalja/HalluVerse-M3}.
Related papers
- Investigating Hallucination in Conversations for Low Resource Languages [6.439114994667614]
Large Language Models (LLMs) have demonstrated remarkable proficiency in generating text that closely resembles human writing.<n>They often generate factually incorrect statements, a problem typically referred to as 'hallucination'<n>This study extends this investigation to conversational data in three languages: Hindi, Farsi, and Mandarin.<n>We found that LLMs produce very few hallucinated responses in Mandarin but generate a significantly higher number of hallucinations in Hindi and Farsi.
arXiv Detail & Related papers (2025-07-30T14:39:51Z) - HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations [2.3732122943029164]
We introduce HalluVerse25, a multilingual dataset that categorizes fine-grained hallucinations in English, Arabic, and Turkish.<n>Our dataset construction pipeline uses an LLM to inject hallucinations into factual biographical sentences, followed by a rigorous human annotation process to ensure data quality.
arXiv Detail & Related papers (2025-03-10T20:24:07Z) - Multilingual Hallucination Gaps in Large Language Models [5.505634045241288]
We study the phenomenon of hallucinations across multiple languages in freeform text generation.
These gaps reflect differences in the frequency of hallucinated answers depending on the prompt and language used.
Our results reveal variations in hallucination rates, especially between high and low resource languages.
arXiv Detail & Related papers (2024-10-23T20:41:51Z) - LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models [96.64960606650115]
LongHalQA is an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text.
LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios.
arXiv Detail & Related papers (2024-10-13T18:59:58Z) - Mitigating Multilingual Hallucination in Large Vision-Language Models [35.75851356840673]
We propose a two-stage Multilingual Hallucination Removal (MHR) framework for Large Vision-Language Models (LVLMs)
Instead of relying on the intricate manual annotations of multilingual resources, we propose a novel cross-lingual alignment method.
Our framework delivers an average increase of 19.0% in accuracy across 13 different languages.
arXiv Detail & Related papers (2024-08-01T13:34:35Z) - ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models [65.12177400764506]
Large language models (LLMs) exhibit hallucinations in long-form question-answering tasks across various domains and wide applications.<n>Current hallucination detection and mitigation datasets are limited in domains and sizes.<n>This paper introduces an iterative self-training framework that simultaneously and progressively scales up the hallucination annotation dataset.
arXiv Detail & Related papers (2024-07-05T17:56:38Z) - VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)
VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z) - DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models [26.289847386286446]
We propose DiaHalu, the first dialogue-level hallucination evaluation benchmark to our knowledge.
We integrate the collected topics into system prompts and facilitate a dialogue between two ChatGPT3.5.
We manually modify the contents that do not adhere to human language conventions and then have LLMs re-generate, simulating authentic human-machine interaction scenarios.
arXiv Detail & Related papers (2024-03-01T15:38:55Z) - Fine-grained Hallucination Detection and Editing for Language Models [109.56911670376932]
Large language models (LMs) are prone to generate factual errors, which are often called hallucinations.
We introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms.
We propose a novel task of automatic fine-grained hallucination detection and construct a new evaluation benchmark, FavaBench.
arXiv Detail & Related papers (2024-01-12T19:02:48Z) - Hallucination Augmented Contrastive Learning for Multimodal Large
Language Model [53.65682783591723]
Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks.
However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information.
In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning.
arXiv Detail & Related papers (2023-12-12T04:05:15Z) - Hallucinations in Large Multilingual Translation Models [70.10455226752015]
Large-scale multilingual machine translation systems have demonstrated remarkable ability to translate directly between numerous languages.
When deployed in the wild, these models may generate hallucinated translations which have the potential to severely undermine user trust and raise safety concerns.
Existing research on hallucinations has primarily focused on small bilingual models trained on high-resource languages.
arXiv Detail & Related papers (2023-03-28T16:17:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.