Related papers: Comparing Hallucination Detection Metrics for Multilingual Generation

Comparing Hallucination Detection Metrics for Multilingual Generation

URL: http://arxiv.org/abs/2402.10496v2
Date: Sun, 16 Jun 2024 00:44:28 GMT
Title: Comparing Hallucination Detection Metrics for Multilingual Generation
Authors: Haoqiang Kang, Terra Blevins, Luke Zettlemoyer,
Abstract summary: This paper assesses how well various factual hallucination detection metrics identify hallucinations in generated biographical summaries across languages. We compare how well automatic metrics correlate to each other and whether they agree with human judgments of factuality. Our analysis reveals that while the lexical metrics are ineffective, NLI-based metrics perform well, correlating with human annotations in many settings and often outperforming supervised models.
Score: 62.97224994631494
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While many hallucination detection techniques have been evaluated on English text, their effectiveness in multilingual contexts remains unknown. This paper assesses how well various factual hallucination detection metrics (lexical metrics like ROUGE and Named Entity Overlap, and Natural Language Inference (NLI)-based metrics) identify hallucinations in generated biographical summaries across languages. We compare how well automatic metrics correlate to each other and whether they agree with human judgments of factuality. Our analysis reveals that while the lexical metrics are ineffective, NLI-based metrics perform well, correlating with human annotations in many settings and often outperforming supervised models. However, NLI metrics are still limited, as they do not detect single-fact hallucinations well and fail for lower-resource languages. Therefore, our findings highlight the gaps in exisiting hallucination detection methods for non-English languages and motivate future research to develop more robust multilingual detection methods for LLM hallucinations.

Related papers

Evaluating Evaluation Metrics -- The Mirage of Hallucination Detection [26.521892016176036]
Hallucinations pose a significant obstacle to the reliability and widespread adoption of language models. We conduct a large-scale empirical evaluation of hallucination detection metrics across 4 datasets, 37 language models from 5 families, and 5 decoding methods.
arXiv Detail & Related papers (2025-04-25T06:37:29Z)
Poly-FEVER: A Multilingual Fact Verification Benchmark for Hallucination Detection in Large Language Models [10.663446796160567]
Hallucinations in generative AI, particularly in Large Language Models (LLMs), pose a significant challenge to the reliability of multilingual applications. Existing benchmarks for hallucination detection focus primarily on English and a few widely spoken languages. We introduce Poly-FEVER, a large-scale multilingual fact verification benchmark.
arXiv Detail & Related papers (2025-03-19T01:46:09Z)
How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild [11.82100047858478]
hallucination is the tendency of Large Language Models to generate non-factual or unfaithful responses. We train a multilingual hallucination detection model and conduct a large-scale study across 30 languages. We find that while LLMs generate longer responses with more hallucinated tokens for higher-resource languages, there is no correlation between length-normalized hallucination rates of languages and their digital representation.
arXiv Detail & Related papers (2025-02-18T11:32:43Z)
Dialectal Toxicity Detection: Evaluating LLM-as-a-Judge Consistency Across Language Varieties [23.777874316083984]
There has been little systematic study on how dialectal differences affect toxicity detection by modern LLMs. We create a multi-dialect dataset through synthetic transformations and human-assisted translations, covering 10 language clusters and 60 varieties. We then evaluated three LLMs on their ability to assess toxicity across multilingual, dialectal, and LLM-human consistency.
arXiv Detail & Related papers (2024-11-17T03:53:24Z)
Multilingual Hallucination Gaps in Large Language Models [5.505634045241288]
We study the phenomenon of hallucinations across multiple languages in freeform text generation. These gaps reflect differences in the frequency of hallucinated answers depending on the prompt and language used. Our results reveal variations in hallucination rates, especially between high and low resource languages.
arXiv Detail & Related papers (2024-10-23T20:41:51Z)
Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models [70.19081534515371]
Large Language Models (LLMs) have gained widespread adoption in various natural language processing tasks. They generate unfaithful or inconsistent content that deviates from the input source, leading to severe consequences. We propose a robust discriminator named RelD to effectively detect hallucination in LLMs' generated answers.
arXiv Detail & Related papers (2024-07-04T18:47:42Z)
Fine-grained Hallucination Detection and Editing for Language Models [109.56911670376932]
Large language models (LMs) are prone to generate factual errors, which are often called hallucinations. We introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms. We propose a novel task of automatic fine-grained hallucination detection and construct a new evaluation benchmark, FavaBench.
arXiv Detail & Related papers (2024-01-12T19:02:48Z)
Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus [99.33091772494751]
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields. LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations. We propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs.
arXiv Detail & Related papers (2023-11-22T08:39:17Z)
A New Benchmark and Reverse Validation Method for Passage-level Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks. We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion. We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z)
AutoHall: Automated Hallucination Dataset Generation for Large Language Models [56.92068213969036]
This paper introduces a method for automatically constructing model-specific hallucination datasets based on existing fact-checking datasets called AutoHall. We also propose a zero-resource and black-box hallucination detection method based on self-contradiction.
arXiv Detail & Related papers (2023-09-30T05:20:02Z)
Zero-Resource Hallucination Prevention for Large Language Models [45.4155729393135]
"Hallucination" refers to instances where large language models (LLMs) generate factually inaccurate or ungrounded information. We introduce a novel pre-language self-evaluation technique, referred to as SELF-FAMILIARITY, which focuses on evaluating the model's familiarity with the concepts present in the input instruction. We validate SELF-FAMILIARITY across four different large language models, demonstrating consistently superior performance compared to existing techniques.
arXiv Detail & Related papers (2023-09-06T01:57:36Z)
Detecting and Mitigating Hallucinations in Multilingual Summarisation [40.5267502712576]
Hallucinations pose a significant challenge to the reliability of neural models for abstractive summarisation. We develop a novel metric, mFACT, evaluating the faithfulness of non-English summaries. We then propose a simple but effective method to reduce hallucinations with a cross-lingual transfer.
arXiv Detail & Related papers (2023-05-23T02:59:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.