Related papers: FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification

FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification

URL: http://arxiv.org/abs/2508.05782v1
Date: Thu, 07 Aug 2025 18:51:03 GMT
Title: FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification
Authors: Xiangyan Chen, Yufeng Li, Yujian Gan, Arkaitz Zubiaga, Matthew Purver,
Abstract summary: Large Language Models (LLMs) are known to produce hallucinations - factually incorrect or fabricated information.<n>Current approaches to hallucination detection in dialogue systems primarily focus on verifying the factual consistency of generated responses.<n>We introduce a benchmark, FineDialFact, for fine-grained dialogue fact verification.
Score: 45.2458418225596
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models (LLMs) are known to produce hallucinations - factually incorrect or fabricated information - which poses significant challenges for many Natural Language Processing (NLP) applications, such as dialogue systems. As a result, detecting hallucinations has become a critical area of research. Current approaches to hallucination detection in dialogue systems primarily focus on verifying the factual consistency of generated responses. However, these responses often contain a mix of accurate, inaccurate or unverifiable facts, making one factual label overly simplistic and coarse-grained. In this paper, we introduce a benchmark, FineDialFact, for fine-grained dialogue fact verification, which involves verifying atomic facts extracted from dialogue responses. To support this, we construct a dataset based on publicly available dialogue datasets and evaluate it using various baseline methods. Experimental results demonstrate that methods incorporating Chain-of-Thought (CoT) reasoning can enhance performance in dialogue fact verification. Despite this, the best F1-score achieved on the HybriDialogue, an open-domain dialogue dataset, is only 0.75, indicating that the benchmark remains a challenging task for future research. Our dataset and code will be public on GitHub.

Related papers

Fine-Refine: Iterative Fine-grained Refinement for Mitigating Dialogue Hallucination [6.907950142408847]
hallucinations produce factually incorrect responses that may mislead users and undermine system trust.<n>Existing refinement methods for dialogue systems typically operate at the response level, overlooking the fact that a single response may contain multiple verifiable or unverifiable facts.<n>We propose Fine-Refine, a fine-grained refinement framework that decomposes responses into atomic units, verifies each unit using external knowledge, assesses fluency via perplexity, and iteratively corrects granular errors.
arXiv Detail & Related papers (2026-02-17T11:33:23Z)
Hallucination Detection with Small Language Models [1.9181612035055007]
This paper proposes a framework that integrates multiple small language models to verify responses generated by large language models.<n>The results demonstrate a 10% improvement in F1 scores for detecting correct responses compared to hallucinations.
arXiv Detail & Related papers (2025-06-24T02:19:26Z)
Improving Factuality for Dialogue Response Generation via Graph-Based Knowledge Augmentation [8.423723358002539]
Large Language Models (LLMs) generate plausible but inconsistent or factually incorrect text.<n>We propose two novel graph knowledge-augmented frameworks, Dialogue Response Generation via Textualised Graphs (TG-DRG) and Graph-Aware Dialogue Response Generation (GA-DRG)<n>TG-DRG combines reasoning-guided dialogue reformulation, dialogue sense knowledge selection, and graph-enhanced response generation to improve the factuality of dialogue responses.
arXiv Detail & Related papers (2025-06-14T13:17:27Z)
CoPrUS: Consistency Preserving Utterance Synthesis towards more realistic benchmark dialogues [0.27309692684728604]
We investigate the creation of synthetic communication errors in an automatic pipeline.<n>We focus on three types of miscommunications that could happen in real-world dialogues but are underrepresented in the benchmark dataset.<n>Our two-step approach uses a state-of-the-art Large Language Model (LLM) to first create the error and secondly the repairing utterance.
arXiv Detail & Related papers (2024-12-10T13:51:55Z)
Detecting Response Generation Not Requiring Factual Judgment [14.921007421043198]
This study aimed to achieve both attractiveness and factuality in a dialogue response for which a task was set to predict sentences that do not require factual correctness judgment. We created a dataset, dialogue dataset annotated with fact-check-needed label (DDFC), for this task via crowdsourcing, and classification tasks were performed on several models using this dataset. The model with the highest classification accuracy could yield about 88% accurate classification results.
arXiv Detail & Related papers (2024-06-14T04:03:24Z)
$\textit{Dial BeInfo for Faithfulness}$: Improving Factuality of Information-Seeking Dialogue via Behavioural Fine-Tuning [55.96744451743273]
We introduce BeInfo, a method that applies behavioural tuning to aid information-seeking dialogue systems. We show that models tuned with BeInfo become considerably more faithful to the knowledge source. We also show that the models with 3B parameters tuned with BeInfo demonstrate strong performance on data from real production' conversations.
arXiv Detail & Related papers (2023-11-16T11:25:44Z)
FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs. FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation. We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z)
RefGPT: Dialogue Generation of GPT, by GPT, and for GPT [61.451780081612974]
Large Language Models (LLMs) have attained the impressive capability to resolve a wide range of NLP tasks by fine-tuning high-quality instruction data. However, collecting human-written data of high quality, especially multi-turn dialogues, is expensive and unattainable for most people. We propose a method called RefGPT to generate enormous truthful and customized dialogues without worrying about factual errors caused by the model hallucination.
arXiv Detail & Related papers (2023-05-24T10:30:42Z)
Weakly Supervised Data Augmentation Through Prompting for Dialogue Understanding [103.94325597273316]
We present a novel approach that iterates on augmentation quality by applying weakly-supervised filters. We evaluate our methods on the emotion and act classification tasks in DailyDialog and the intent classification task in Facebook Multilingual Task-Oriented Dialogue. For DailyDialog specifically, using 10% of the ground truth data we outperform the current state-of-the-art model which uses 100% of the data.
arXiv Detail & Related papers (2022-10-25T17:01:30Z)
SPACE-2: Tree-Structured Semi-Supervised Contrastive Pre-training for Task-Oriented Dialog Understanding [68.94808536012371]
We propose a tree-structured pre-trained conversation model, which learns dialog representations from limited labeled dialogs and large-scale unlabeled dialog corpora. Our method can achieve new state-of-the-art results on the DialoGLUE benchmark consisting of seven datasets and four popular dialog understanding tasks.
arXiv Detail & Related papers (2022-09-14T13:42:50Z)
DialFact: A Benchmark for Fact-Checking in Dialogue [56.63709206232572]
We construct DialFact, a benchmark dataset of 22,245 annotated conversational claims, paired with pieces of evidence from Wikipedia. We find that existing fact-checking models trained on non-dialogue data like FEVER fail to perform well on our task. We propose a simple yet data-efficient solution to effectively improve fact-checking performance in dialogue.
arXiv Detail & Related papers (2021-10-15T17:34:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.