When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA
- URL: http://arxiv.org/abs/2510.04849v1
- Date: Mon, 06 Oct 2025 14:36:30 GMT
- Title: When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA
- Authors: Elisei Rykov, Kseniia Petrushina, Maksim Savkin, Valerii Olisov, Artem Vazhentsev, Kseniia Titova, Alexander Panchenko, Vasily Konovalov, Julia Belikova,
- Abstract summary: PsiloQA is a large-scale, multilingual dataset annotated with span-level hallucinations across 14 languages.<n>Our dataset and results advance the development of scalable, fine-grained hallucination detection in multilingual settings.
- Score: 46.50540400870401
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hallucination detection remains a fundamental challenge for the safe and reliable deployment of large language models (LLMs), especially in applications requiring factual accuracy. Existing hallucination benchmarks often operate at the sequence level and are limited to English, lacking the fine-grained, multilingual supervision needed for a comprehensive evaluation. In this work, we introduce PsiloQA, a large-scale, multilingual dataset annotated with span-level hallucinations across 14 languages. PsiloQA is constructed through an automated three-stage pipeline: generating question-answer pairs from Wikipedia using GPT-4o, eliciting potentially hallucinated answers from diverse LLMs in a no-context setting, and automatically annotating hallucinated spans using GPT-4o by comparing against golden answers and retrieved context. We evaluate a wide range of hallucination detection methods -- including uncertainty quantification, LLM-based tagging, and fine-tuned encoder models -- and show that encoder-based models achieve the strongest performance across languages. Furthermore, PsiloQA demonstrates effective cross-lingual generalization and supports robust knowledge transfer to other benchmarks, all while being significantly more cost-efficient than human-annotated datasets. Our dataset and results advance the development of scalable, fine-grained hallucination detection in multilingual settings.
Related papers
- Ask a Local: Detecting Hallucinations With Specialized Model Divergence [0.16874375111244325]
We introduce "Ask a Local", a novel hallucination detection method for large language models.<n>Our approach computes divergence between perplexity distributions of language-specialized models to identify potentially hallucinated spans.<n>Our results on a human-annotated question-answer dataset spanning 14 languages demonstrate consistent performance across languages.
arXiv Detail & Related papers (2025-06-03T20:00:49Z) - Poly-FEVER: A Multilingual Fact Verification Benchmark for Hallucination Detection in Large Language Models [10.663446796160567]
Hallucinations in generative AI, particularly in Large Language Models (LLMs), pose a significant challenge to the reliability of multilingual applications.<n>Existing benchmarks for hallucination detection focus primarily on English and a few widely spoken languages.<n>We introduce Poly-FEVER, a large-scale multilingual fact verification benchmark.
arXiv Detail & Related papers (2025-03-19T01:46:09Z) - LargePiG: Your Large Language Model is Secretly a Pointer Generator [15.248956952849259]
We introduce relevance hallucination and factuality hallucination as a new typology for hallucination problems brought by query generation based on Large Language Models (LLMs)
We propose an effective way to separate content from form in LLM-generated queries, which preserves the factual knowledge extracted and integrated from the inputs and compiles the syntactic structure, including function words, using the powerful linguistic capabilities of the LLM.
arXiv Detail & Related papers (2024-10-15T07:41:40Z) - LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models [96.64960606650115]
LongHalQA is an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text.
LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios.
arXiv Detail & Related papers (2024-10-13T18:59:58Z) - ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models [65.12177400764506]
Large language models (LLMs) exhibit hallucinations in long-form question-answering tasks across various domains and wide applications.<n>Current hallucination detection and mitigation datasets are limited in domains and sizes.<n>This paper introduces an iterative self-training framework that simultaneously and progressively scales up the hallucination annotation dataset.
arXiv Detail & Related papers (2024-07-05T17:56:38Z) - HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination
Tendency of LLMs [0.0]
Hallucinations pose a significant challenge to the reliability and alignment of Large Language Models (LLMs)
This paper introduces an automated scalable framework that combines benchmarking LLMs' hallucination tendencies with efficient hallucination detection.
The framework is domain-agnostic, allowing the use of any language model for benchmark creation or evaluation in any domain.
arXiv Detail & Related papers (2024-02-25T22:23:37Z) - Comparing Hallucination Detection Metrics for Multilingual Generation [62.97224994631494]
This paper assesses how well various factual hallucination detection metrics identify hallucinations in generated biographical summaries across languages.
We compare how well automatic metrics correlate to each other and whether they agree with human judgments of factuality.
Our analysis reveals that while the lexical metrics are ineffective, NLI-based metrics perform well, correlating with human annotations in many settings and often outperforming supervised models.
arXiv Detail & Related papers (2024-02-16T08:10:34Z) - AutoHall: Automated Hallucination Dataset Generation for Large Language Models [56.92068213969036]
This paper introduces a method for automatically constructing model-specific hallucination datasets based on existing fact-checking datasets called AutoHall.
We also propose a zero-resource and black-box hallucination detection method based on self-contradiction.
arXiv Detail & Related papers (2023-09-30T05:20:02Z) - Detecting Hallucinated Content in Conditional Neural Sequence Generation [165.68948078624499]
We propose a task to predict whether each token in the output sequence is hallucinated (not contained in the input)
We also introduce a method for learning to detect hallucinations using pretrained language models fine tuned on synthetic data.
arXiv Detail & Related papers (2020-11-05T00:18:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.