Related papers: C-FAITH: A Chinese Fine-Grained Benchmark for Automated Hallucination Evaluation

C-FAITH: A Chinese Fine-Grained Benchmark for Automated Hallucination Evaluation

URL: http://arxiv.org/abs/2504.10167v1
Date: Mon, 14 Apr 2025 12:21:55 GMT
Title: C-FAITH: A Chinese Fine-Grained Benchmark for Automated Hallucination Evaluation
Authors: Xu Zhang, Zhifei Liu, Jiahao Wang, Huixuan Zhang, Fan Xu, Junzhe Zhang, Xiaojun Wan,
Abstract summary: We introduce HaluAgent, an agentic framework that automatically constructs fine-grained QA dataset based on some knowledge documents.<n>Our experiments demonstrate that the manually designed rules and prompt optimization can improve the quality of generated data.
Score: 58.40263551616771
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the rapid advancement of large language models, they remain highly susceptible to generating hallucinations, which significantly hinders their widespread application. Hallucination research requires dynamic and fine-grained evaluation. However, most existing hallucination benchmarks (especially in Chinese language) rely on human annotations, making automatical and cost-effective hallucination evaluation challenging. To address this, we introduce HaluAgent, an agentic framework that automatically constructs fine-grained QA dataset based on some knowledge documents. Our experiments demonstrate that the manually designed rules and prompt optimization can improve the quality of generated data. Using HaluAgent, we construct C-FAITH, a Chinese QA hallucination benchmark created from 1,399 knowledge documents obtained from web scraping, totaling 60,702 entries. We comprehensively evaluate 16 mainstream LLMs with our proposed C-FAITH, providing detailed experimental results and analysis.

Related papers

DSC2025 -- ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs [5.740643252319679]
This paper introduces the DSC2025 ViHallu Challenge, the first large-scale shared task for detecting hallucinations in Vietnamese language models.<n>We present the ViHallu dataset, comprising 10,000 annotated triplets of (context, prompt, response) samples.<n>A total of 111 teams participated, with the best-performing system achieving a macro-F1 score of 84.80%, compared to a baseline encoder-only score of 32.83%.
arXiv Detail & Related papers (2026-01-08T08:27:47Z)
A Systematic Literature Review of Code Hallucinations in LLMs: Characterization, Mitigation Methods, Challenges, and Future Directions for Reliable AI [54.34738767990601]
As Large Language Models become increasingly integrated into software engineering tasks, understanding and mitigating hallucination in code becomes essential.<n>We provide a systematic review of hallucination phenomena in code-oriented LLMs from four key perspectives.
arXiv Detail & Related papers (2025-11-02T02:58:41Z)
SHALE: A Scalable Benchmark for Fine-grained Hallucination Evaluation in LVLMs [52.03164192840023]
Large Vision-Language Models (LVLMs) still suffer from hallucinations, i.e., generating content inconsistent with input or established world knowledge.<n>We propose an automated data construction pipeline that produces scalable, controllable, and diverse evaluation data.<n>We construct SHALE, a benchmark designed to assess both faithfulness and factuality hallucinations.
arXiv Detail & Related papers (2025-08-13T07:58:01Z)
HalluLens: LLM Hallucination Benchmark [49.170128733508335]
Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as "hallucination" This paper introduces a comprehensive hallucination benchmark, incorporating both new extrinsic and existing intrinsic evaluation tasks.
arXiv Detail & Related papers (2025-04-24T13:40:27Z)
HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation [19.318217051269382]
Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP) HalluDial is the first comprehensive large-scale benchmark for automatic dialogue-level hallucination evaluation. The benchmark includes 4,094 dialogues with a total of 146,856 samples.
arXiv Detail & Related papers (2024-06-11T08:56:18Z)
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback [40.930238150365795]
We propose detecting and mitigating hallucinations in Large Vision Language Models (LVLMs) via fine-grained AI feedback. We generate a small-size hallucination annotation dataset by proprietary models. Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for training hallucination mitigating model.
arXiv Detail & Related papers (2024-04-22T14:46:10Z)
HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMs [0.0]
Hallucinations pose a significant challenge to the reliability and alignment of Large Language Models (LLMs) This paper introduces an automated scalable framework that combines benchmarking LLMs' hallucination tendencies with efficient hallucination detection. The framework is domain-agnostic, allowing the use of any language model for benchmark creation or evaluation in any domain.
arXiv Detail & Related papers (2024-02-25T22:23:37Z)
Fine-grained Hallucination Detection and Editing for Language Models [109.56911670376932]
Large language models (LMs) are prone to generate factual errors, which are often called hallucinations. We introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms. We propose a novel task of automatic fine-grained hallucination detection and construct a new evaluation benchmark, FavaBench.
arXiv Detail & Related papers (2024-01-12T19:02:48Z)
Alleviating Hallucinations of Large Language Models through Induced Hallucinations [67.35512483340837]
Large language models (LLMs) have been observed to generate responses that include inaccurate or fabricated information. We propose a simple textitInduce-then-Contrast Decoding (ICD) strategy to alleviate hallucinations.
arXiv Detail & Related papers (2023-12-25T12:32:49Z)
AutoHall: Automated Hallucination Dataset Generation for Large Language Models [56.92068213969036]
This paper introduces a method for automatically constructing model-specific hallucination datasets based on existing fact-checking datasets called AutoHall. We also propose a zero-resource and black-box hallucination detection method based on self-contradiction.
arXiv Detail & Related papers (2023-09-30T05:20:02Z)
HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models [146.87696738011712]
Large language models (LLMs) are prone to generate hallucinations, i.e., content that conflicts with the source or cannot be verified by the factual knowledge. To understand what types of content and to which extent LLMs are apt to hallucinate, we introduce the Hallucination Evaluation benchmark for Large Language Models (HaluEval)
arXiv Detail & Related papers (2023-05-19T15:36:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.