DecMetrics: Structured Claim Decomposition Scoring for Factually Consistent LLM Outputs
- URL: http://arxiv.org/abs/2509.04483v1
- Date: Sun, 31 Aug 2025 10:22:54 GMT
- Title: DecMetrics: Structured Claim Decomposition Scoring for Factually Consistent LLM Outputs
- Authors: Minghui Huang,
- Abstract summary: We introduce textbfDecMetrics, which comprises three new metrics: textttCOMPLETENESS, textttCORRECTNESS, and textttSEMANTIC ENTROPY, designed to automatically assess the quality of claims produced by decomposition models.<n>Our approach aims to set a benchmark for claim decomposition, enhancing both the reliability and effectiveness of fact-checking systems.
- Score: 0.609170287691728
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Claim decomposition plays a crucial role in the fact-checking process by breaking down complex claims into simpler atomic components and identifying their unfactual elements. Despite its importance, current research primarily focuses on generative methods for decomposition, with insufficient emphasis on evaluating the quality of these decomposed atomic claims. To bridge this gap, we introduce \textbf{DecMetrics}, which comprises three new metrics: \texttt{COMPLETENESS}, \texttt{CORRECTNESS}, and \texttt{SEMANTIC ENTROPY}, designed to automatically assess the quality of claims produced by decomposition models. Utilizing these metrics, we develop a lightweight claim decomposition model, optimizing its performance through the integration of these metrics as a reward function. Through automatic evaluation, our approach aims to set a benchmark for claim decomposition, enhancing both the reliability and effectiveness of fact-checking systems.
Related papers
- Distill and Align Decomposition for Enhanced Claim Verification [51.93960785128124]
Complex claim verification requires decomposing sentences into verifiable subclaims.<n>We propose a reinforcement learning approach that optimises decomposition quality and verifier alignment.<n>Our framework enables smaller language models to achieve state-of-the-art claim verification.
arXiv Detail & Related papers (2026-02-25T12:32:04Z) - DREAM: Deep Research Evaluation with Agentic Metrics [21.555357444628044]
We propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that makes evaluation itself agentic.<n> DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent.<n>Controlled evaluations demonstrate DREAM is significantly more sensitive to factual and temporal decay than existing benchmarks.
arXiv Detail & Related papers (2026-02-21T19:14:31Z) - Beyond Raw Detection Scores: Markov-Informed Calibration for Boosting Machine-Generated Text Detection [105.14032334647932]
Machine-generated texts (MGTs) pose risks such as disinformation and phishing, highlighting the need for reliable detection.<n> Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting.<n>We propose a Markov-informed score calibration strategy that models two relationships of context detection scores that may aid calibration.
arXiv Detail & Related papers (2026-02-08T16:06:12Z) - Procrustean Bed for AI-Driven Retrosynthesis: A Unified Framework for Reproducible Evaluation [0.0]
RetroCast is a unified evaluation suite that standardizes heterogeneous model outputs into a common schema.<n>We evaluate leading search-based and sequence-based algorithms on a new suite of standardized benchmarks.
arXiv Detail & Related papers (2025-12-08T01:26:39Z) - Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution [76.66229730098759]
In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models.<n>We propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution.<n>We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert.
arXiv Detail & Related papers (2025-11-20T04:11:44Z) - Automated Skill Decomposition Meets Expert Ontologies: Bridging the Granularity Gap with LLMs [1.2891210250935148]
This paper investigates automated skill decomposition using Large Language Models (LLMs)<n>Our framework standardizes the pipeline from prompting and generation to normalization and alignment with ontology nodes.<n>To evaluate outputs, we introduce two metrics: a F1-score that uses optimal embedding-based matching to assess content accuracy, and a hierarchy-aware F1-score that credits structurally correct placements to assess granularity.
arXiv Detail & Related papers (2025-10-13T12:03:06Z) - Source-Free Object Detection with Detection Transformer [59.33653163035064]
Source-Free Object Detection (SFOD) enables knowledge transfer from a source domain to an unsupervised target domain for object detection without access to source data.<n>Most existing SFOD approaches are either confined to conventional object detection (OD) models like Faster R-CNN or designed as general solutions without tailored adaptations for novel OD architectures, especially Detection Transformer (DETR)<n>In this paper, we introduce Feature Reweighting ANd Contrastive Learning NetworK (FRANCK), a novel SFOD framework specifically designed to perform query-centric feature enhancement for DETRs.
arXiv Detail & Related papers (2025-10-13T07:35:04Z) - ERIS: An Energy-Guided Feature Disentanglement Framework for Out-of-Distribution Time Series Classification [51.07970070817353]
An ideal time series classification (TSC) should be able to capture invariant representations.<n>Current methods are largely unguided, lacking the semantic direction required to isolate truly universal features.<n>We propose an end-to-end Energy-Regularized Information for Shift-Robustness framework to enable guided and reliable feature disentanglement.
arXiv Detail & Related papers (2025-08-19T12:13:41Z) - Arg-LLaDA: Argument Summarization via Large Language Diffusion Models and Sufficiency-Aware Refinement [27.673022970833163]
We introduce Arg-LLaDA, a novel large language diffusion framework that iteratively improves summaries.<n>Our method combines a flexible masking controller with a sufficiency-checking module to identify and revise unsupported, redundant, or incomplete spans.<n> Empirical results on two benchmark datasets demonstrate that Arg-LLaDA surpasses state-of-the-art baselines in 7 out of 10 automatic evaluation metrics.
arXiv Detail & Related papers (2025-07-25T09:07:52Z) - AlignRAG: Leveraging Critique Learning for Evidence-Sensitive Retrieval-Augmented Reasoning [61.28113271728859]
RAG has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>Standard RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>In this work, we reinterpret RAG as Retrieval-Augmented Reasoning and identify a central but underexplored problem: textitReasoning Misalignment.
arXiv Detail & Related papers (2025-04-21T04:56:47Z) - MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System [11.793639794583498]
This paper introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness.<n>We highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances.<n>We devise the Mixture-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism.
arXiv Detail & Related papers (2025-03-12T17:59:42Z) - StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs.<n> Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets.<n>We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z) - Atomic Fact Decomposition Helps Attributed Question Answering [29.67882325906939]
Attributed Question Answering (AQA) aims to provide both a trustworthy answer and a reliable attribution report for a question.<n>This paper proposes an Atomic fact decomposition-based Retrieval and Editing framework.<n>It decomposes the generated long-form answers into molecular clauses and atomic facts by the instruction-tuned LLMs.
arXiv Detail & Related papers (2024-10-22T05:25:54Z) - Localizing Factual Inconsistencies in Attributable Text Generation [74.11403803488643]
We introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation.<n>We show that QASemConsistency yields factual consistency scores that correlate well with human judgments.
arXiv Detail & Related papers (2024-10-09T22:53:48Z) - A Closer Look at Claim Decomposition [42.07832693585166]
We investigate how various methods of claim decomposition affect the result of an evaluation approach such as the recently proposed FActScore.
We then propose an LLM-based approach to generating decompositions inspired by Bertrand Russell's theory of logical atomism and neo-Davidsonian semantics.
arXiv Detail & Related papers (2024-03-18T16:03:45Z) - DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and
Improvement of Large Language Models [4.953092503184905]
This work proposes DCR, an automated framework for evaluating and improving the consistency of Large Language Models (LLMs) generated texts.
We introduce an automatic metric converter (AMC) that translates the output from DCE into an interpretable numeric score.
Our approach also substantially reduces nearly 90% of output inconsistencies, showing promise for effective hallucination mitigation.
arXiv Detail & Related papers (2024-01-04T08:34:16Z) - Multi-Fact Correction in Abstractive Text Summarization [98.27031108197944]
Span-Fact is a suite of two factual correction models that leverages knowledge learned from question answering models to make corrections in system-generated summaries via span selection.
Our models employ single or multi-masking strategies to either iteratively or auto-regressively replace entities in order to ensure semantic consistency w.r.t. the source text.
Experiments show that our models significantly boost the factual consistency of system-generated summaries without sacrificing summary quality in terms of both automatic metrics and human evaluation.
arXiv Detail & Related papers (2020-10-06T02:51:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.