Related papers: DSC2025 -- ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs

DSC2025 -- ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs

URL: http://arxiv.org/abs/2601.04711v1
Date: Thu, 08 Jan 2026 08:27:47 GMT
Title: DSC2025 -- ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs
Authors: Anh Thi-Hoang Nguyen, Khanh Quoc Tran, Tin Van Huynh, Phuoc Tan-Hoang Nguyen, Cam Tan Nguyen, Kiet Van Nguyen,
Abstract summary: This paper introduces the DSC2025 ViHallu Challenge, the first large-scale shared task for detecting hallucinations in Vietnamese language models.<n>We present the ViHallu dataset, comprising 10,000 annotated triplets of (context, prompt, response) samples.<n>A total of 111 teams participated, with the best-performing system achieving a macro-F1 score of 84.80%, compared to a baseline encoder-only score of 32.83%.
Score: 5.740643252319679
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The reliability of large language models (LLMs) in production environments remains significantly constrained by their propensity to generate hallucinations--fluent, plausible-sounding outputs that contradict or fabricate information. While hallucination detection has recently emerged as a priority in English-centric benchmarks, low-to-medium resource languages such as Vietnamese remain inadequately covered by standardized evaluation frameworks. This paper introduces the DSC2025 ViHallu Challenge, the first large-scale shared task for detecting hallucinations in Vietnamese LLMs. We present the ViHallu dataset, comprising 10,000 annotated triplets of (context, prompt, response) samples systematically partitioned into three hallucination categories: no hallucination, intrinsic, and extrinsic hallucinations. The dataset incorporates three prompt types--factual, noisy, and adversarial--to stress-test model robustness. A total of 111 teams participated, with the best-performing system achieving a macro-F1 score of 84.80\%, compared to a baseline encoder-only score of 32.83\%, demonstrating that instruction-tuned LLMs with structured prompting and ensemble strategies substantially outperform generic architectures. However, the gap to perfect performance indicates that hallucination detection remains a challenging problem, particularly for intrinsic (contradiction-based) hallucinations. This work establishes a rigorous benchmark and explores a diverse range of detection methodologies, providing a foundation for future research into the trustworthiness and reliability of Vietnamese language AI systems.

Related papers

HalluGuard: Demystifying Data-Driven and Reasoning-Driven Hallucinations in LLMs [20.59139155257836]
We introduce a unified theoretical framework that decomposes hallucination risk into data-driven and reasoning-driven components.<n>We introduce HalluGuard, an NTK-based score that leverages the induced geometry and captured representations of the NTK to jointly identify data-driven and reasoning-driven hallucinations.
arXiv Detail & Related papers (2026-01-26T18:23:09Z)
HAD: HAllucination Detection Language Models Based on a Comprehensive Hallucination Taxonomy [48.68088917291552]
We introduce a comprehensive hallucination taxonomy with 11 categories across various NLG tasks.<n>We propose the HAllucination Detection (HAD) models, which integrate hallucination detection, span-level identification, and correction into a single inference process.
arXiv Detail & Related papers (2025-10-22T07:28:37Z)
Beyond ROUGE: N-Gram Subspace Features for LLM Hallucination Detection [5.0106565473767075]
Large Language Models (LLMs) have demonstrated effectiveness across a wide variety of tasks involving natural language.<n>A fundamental problem of hallucinations still plagues these models, limiting their trustworthiness in generating consistent, truthful information.<n>We propose a novel approach inspired by ROUGE that constructs an N-Gram frequency tensor from LLM-generated text.<n>This tensor captures richer semantic structure by encoding co-occurrence patterns, enabling better differentiation between factual and hallucinated content.
arXiv Detail & Related papers (2025-09-03T18:52:24Z)
SHALE: A Scalable Benchmark for Fine-grained Hallucination Evaluation in LVLMs [52.03164192840023]
Large Vision-Language Models (LVLMs) still suffer from hallucinations, i.e., generating content inconsistent with input or established world knowledge.<n>We propose an automated data construction pipeline that produces scalable, controllable, and diverse evaluation data.<n>We construct SHALE, a benchmark designed to assess both faithfulness and factuality hallucinations.
arXiv Detail & Related papers (2025-08-13T07:58:01Z)
HIDE and Seek: Detecting Hallucinations in Language Models via Decoupled Representations [17.673293240849787]
Contemporary Language Models (LMs) often generate content that is factually incorrect or unfaithful to the input context.<n>We propose a single-pass, training-free approach for effective Hallucination detectIon via Decoupled rEpresentations (HIDE)<n>Our results demonstrate that HIDE outperforms other single-pass methods in almost all settings.
arXiv Detail & Related papers (2025-06-21T16:02:49Z)
HalluLens: LLM Hallucination Benchmark [49.170128733508335]
Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as "hallucination"<n>This paper introduces a comprehensive hallucination benchmark, incorporating both new extrinsic and existing intrinsic evaluation tasks.
arXiv Detail & Related papers (2025-04-24T13:40:27Z)
C-FAITH: A Chinese Fine-Grained Benchmark for Automated Hallucination Evaluation [58.40263551616771]
We introduce HaluAgent, an agentic framework that automatically constructs fine-grained QA dataset based on some knowledge documents.<n>Our experiments demonstrate that the manually designed rules and prompt optimization can improve the quality of generated data.
arXiv Detail & Related papers (2025-04-14T12:21:55Z)
HausaNLP at SemEval-2025 Task 3: Towards a Fine-Grained Model-Aware Hallucination Detection [1.8230982862848586]
We aim to provide a nuanced, model-aware understanding of hallucination occurrences and severity in English.<n>We used natural language inference and fine-tuned a ModernBERT model using a synthetic dataset of 400 samples.<n>Results indicate a moderately positive correlation between the model's confidence scores and the actual presence of hallucinations.
arXiv Detail & Related papers (2025-03-25T13:40:22Z)
MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models [82.30696225661615]
We introduce MedHallu, the first benchmark specifically designed for medical hallucination detection.<n>We show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, and medically fine-tuned UltraMedical, struggle with this binary hallucination detection task.<n>Using bidirectional entailment clustering, we show that harder-to-detect hallucinations are semantically closer to ground truth.
arXiv Detail & Related papers (2025-02-20T06:33:23Z)
Reefknot: A Comprehensive Benchmark for Relation Hallucination Evaluation, Analysis and Mitigation in Multimodal Large Language Models [13.48296910438554]
We introduce Reefknot, a comprehensive benchmark targeting relation hallucinations, comprising over 20,000 real-world samples.<n>We provide a systematic definition of relation hallucinations, integrating perceptive and cognitive perspectives, and construct a relation-based corpus using the Visual Genome scene graph dataset.<n>We propose a novel confidence-based mitigation strategy, which reduces the hallucination rate by an average of 9.75% across three datasets, including Reefknot.
arXiv Detail & Related papers (2024-08-18T10:07:02Z)
AutoHall: Automated Hallucination Dataset Generation for Large Language Models [56.92068213969036]
This paper introduces a method for automatically constructing model-specific hallucination datasets based on existing fact-checking datasets called AutoHall. We also propose a zero-resource and black-box hallucination detection method based on self-contradiction.
arXiv Detail & Related papers (2023-09-30T05:20:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.