Poly-FEVER: A Multilingual Fact Verification Benchmark for Hallucination Detection in Large Language Models
- URL: http://arxiv.org/abs/2503.16541v2
- Date: Wed, 26 Mar 2025 23:53:56 GMT
- Title: Poly-FEVER: A Multilingual Fact Verification Benchmark for Hallucination Detection in Large Language Models
- Authors: Hanzhi Zhang, Sumera Anjum, Heng Fan, Weijian Zheng, Yan Huang, Yunhe Feng,
- Abstract summary: Hallucinations in generative AI, particularly in Large Language Models (LLMs), pose a significant challenge to the reliability of multilingual applications.<n>Existing benchmarks for hallucination detection focus primarily on English and a few widely spoken languages.<n>We introduce Poly-FEVER, a large-scale multilingual fact verification benchmark.
- Score: 10.663446796160567
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hallucinations in generative AI, particularly in Large Language Models (LLMs), pose a significant challenge to the reliability of multilingual applications. Existing benchmarks for hallucination detection focus primarily on English and a few widely spoken languages, lacking the breadth to assess inconsistencies in model performance across diverse linguistic contexts. To address this gap, we introduce Poly-FEVER, a large-scale multilingual fact verification benchmark specifically designed for evaluating hallucination detection in LLMs. Poly-FEVER comprises 77,973 labeled factual claims spanning 11 languages, sourced from FEVER, Climate-FEVER, and SciFact. It provides the first large-scale dataset tailored for analyzing hallucination patterns across languages, enabling systematic evaluation of LLMs such as ChatGPT and the LLaMA series. Our analysis reveals how topic distribution and web resource availability influence hallucination frequency, uncovering language-specific biases that impact model accuracy. By offering a multilingual benchmark for fact verification, Poly-FEVER facilitates cross-linguistic comparisons of hallucination detection and contributes to the development of more reliable, language-inclusive AI systems. The dataset is publicly available to advance research in responsible AI, fact-checking methodologies, and multilingual NLP, promoting greater transparency and robustness in LLM performance. The proposed Poly-FEVER is available at: https://huggingface.co/datasets/HanzhiZhang/Poly-FEVER.
Related papers
- Large Language Models for Multilingual Previously Fact-Checked Claim Detection [3.694429692322632]
This paper presents the first comprehensive evaluation of large language models (LLMs) for multilingual previously fact-checked claim detection.<n>We assess seven LLMs across 20 languages in both monolingual and cross-lingual settings.<n>Our results show that while LLMs perform well for high-resource languages, they struggle with low-resource languages.
arXiv Detail & Related papers (2025-03-04T15:56:43Z) - Beyond Translation: LLM-Based Data Generation for Multilingual Fact-Checking [2.321323878201932]
MultiSynFact is the first large-scale multilingual fact-checking dataset containing 2.2M claim-source pairs.<n>Our dataset generation pipeline leverages Large Language Models (LLMs), integrating external knowledge from Wikipedia.<n>We open-source a user-friendly framework to facilitate further research in multilingual fact-checking and dataset generation.
arXiv Detail & Related papers (2025-02-21T12:38:26Z) - P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - LargePiG: Your Large Language Model is Secretly a Pointer Generator [15.248956952849259]
We introduce relevance hallucination and factuality hallucination as a new typology for hallucination problems brought by query generation based on Large Language Models (LLMs)
We propose an effective way to separate content from form in LLM-generated queries, which preserves the factual knowledge extracted and integrated from the inputs and compiles the syntactic structure, including function words, using the powerful linguistic capabilities of the LLM.
arXiv Detail & Related papers (2024-10-15T07:41:40Z) - Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models [22.859955360764275]
We introduce the MultiLingual Needle-in-a-Haystack (MLNeedle) test to assess a model's ability to retrieve relevant information.
We evaluate four state-of-the-art large language models on MLNeedle.
arXiv Detail & Related papers (2024-08-19T17:02:06Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.<n>But can these models relate corresponding concepts across languages, i.e., be crosslingual?<n>This study evaluates state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Comparing Hallucination Detection Metrics for Multilingual Generation [62.97224994631494]
This paper assesses how well various factual hallucination detection metrics identify hallucinations in generated biographical summaries across languages.
We compare how well automatic metrics correlate to each other and whether they agree with human judgments of factuality.
Our analysis reveals that while the lexical metrics are ineffective, NLI-based metrics perform well, correlating with human annotations in many settings and often outperforming supervised models.
arXiv Detail & Related papers (2024-02-16T08:10:34Z) - Do We Need Language-Specific Fact-Checking Models? The Case of Chinese [15.619421104102516]
This paper investigates the potential benefits of language-specific fact-checking models, focusing on the case of Chinese.
We first demonstrate the limitations of translation-based methods and multilingual large language models, highlighting the need for language-specific systems.
We propose a Chinese fact-checking system that can better retrieve evidence from a document by incorporating context information.
arXiv Detail & Related papers (2024-01-27T20:26:03Z) - AutoHall: Automated Hallucination Dataset Generation for Large Language Models [56.92068213969036]
This paper introduces a method for automatically constructing model-specific hallucination datasets based on existing fact-checking datasets called AutoHall.
We also propose a zero-resource and black-box hallucination detection method based on self-contradiction.
arXiv Detail & Related papers (2023-09-30T05:20:02Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of
Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process.
We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks.
Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.