Related papers: CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering

CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering

URL: http://arxiv.org/abs/2602.01348v1
Date: Sun, 01 Feb 2026 17:33:39 GMT
Title: CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering
Authors: Yu Liu, Wenxiao Zhang, Cong Cao, Fangfang Yuan, Weizhuo Chen, Cheng Hu, Pin Xu, Yuling Yang, Kun Peng, Diandian Guo, Qiang Sun, Yanbing Liu, Jin B. Hong, Zhiyuan Ma,
Abstract summary: Retrieval-augmented generation (RAG) is widely used to ground Large Language Models (LLMs) for multi-hop question answering.<n> Reasoning in multi-hop QA is inherently complex due to multi-hop composition and is further destabilized by noisy retrieval.<n>We propose CRAFT, a reinforcement learning framework that trains models to perform faithful reasoning during response generation.
Score: 19.391824811629125
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Retrieval-augmented generation (RAG) is widely used to ground Large Language Models (LLMs) for multi-hop question answering. Recent work mainly focused on improving answer accuracy via fine-tuning and structured or reinforcement-based optimization. However, reliable reasoning in response generation faces three challenges: 1) Reasoning Collapse. Reasoning in multi-hop QA is inherently complex due to multi-hop composition and is further destabilized by noisy retrieval. 2) Reasoning-answer inconsistency. Due to the intrinsic uncertainty of LLM generation and exposure to evidence--distractor mixtures, models may produce correct answers that are not faithfully supported by their intermediate reasoning or evidence. 3) Loss of format control. Traditional chain-of-thought generation often deviates from required structured output formats, leading to incomplete or malformed structured content. To address these challenges, we propose CRAFT (Calibrated Reasoning with Answer-Faithful Traces), a Group Relative Policy Optimization (GRPO) based reinforcement learning framework that trains models to perform faithful reasoning during response generation. CRAFT employs dual reward mechanisms to optimize multi-hop reasoning: deterministic rewards ensure structural correctness while judge-based rewards verify semantic faithfulness. This optimization framework supports controllable trace variants that enable systematic analysis of how structure and scale affect reasoning performance and faithfulness. Experiments on three multi-hop QA benchmarks show that CRAFT improves both answer accuracy and reasoning faithfulness across model scales, with the CRAFT 7B model achieving competitive performance with closed-source LLMs across multiple reasoning trace settings.

Related papers

Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution [79.98699884805636]
Reasoning Execution by Multiple Listeners (REMUL) is a multi-party reinforcement learning approach.<n>REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful.<n>Speakers are rewarded for producing reasoning that is clear to listeners.
arXiv Detail & Related papers (2026-02-18T02:55:55Z)
Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration [49.9937230730202]
We propose Search-R2, a novel Actor-Refiner collaboration framework that enhances reasoning through targeted intervention.<n>Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories.<n>We show that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales.
arXiv Detail & Related papers (2026-02-03T15:32:09Z)
Structured Reasoning for Large Language Models [59.215789462977206]
We propose Structured Reasoning (SCR), a framework that decouples reasoning trajectories into explicit, evaluable, and trainable components.<n>SCR substantially improves reasoning efficiency and self-verification.<n>Compared with existing reasoning paradigms, it reduces output token length by up to 50%.
arXiv Detail & Related papers (2026-01-12T04:04:01Z)
PRISMA: Reinforcement Learning Guided Two-Stage Policy Optimization in Multi-Agent Architecture for Open-Domain Multi-Hop Question Answering [26.994531058178982]
Answering real-world open-domain questions over massive corpora is a critical challenge in Retrieval-Augmented Generation (RAG) systems.<n>Recent research employs reinforcement learning (RL) to end-to-end optimize the retrieval-augmented reasoning process.<n>We propose PRISMA, a decoupled-guided framework featuring a Plan-Retrieve-Inspect-Memoize architecture.
arXiv Detail & Related papers (2026-01-09T01:38:38Z)
Adversarial Yet Cooperative: Multi-Perspective Reasoning in Retrieved-Augmented Language Models [72.4149653187766]
We propose a Reasoner-Verifier framework named Adrialversa Reasoning RAG (ARR)<n>The Reasoner and Verifier engage in reasoning on retrieved evidence and critiquing each other's logic while being guided by process-aware advantage.<n> Experiments on multiple benchmarks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2026-01-08T06:57:03Z)
From <Answer> to <Think>: Multidimensional Supervision of Reasoning Process for LLM Optimization [62.07990937720985]
Dimension-level Reward Model (DRM) is a new supervision framework for Large Language Models.<n>DRM evaluates the quality of a reasoning process along three fundamental, complementary, and interpretable dimensions.<n> Experimental results show that DRM provides effective supervision signals, guides the optimization of LLMs and enhances their reasoning ability.
arXiv Detail & Related papers (2025-10-13T14:29:15Z)
Answer-Consistent Chain-of-thought Reinforcement Learning For Multi-modal Large Langauge Models [33.398631680508814]
We propose Answer-Consistent Reinforcement Learning that modifies the GRPO algorithm with an auxiliary consistency check.<n>We design a consistency-verification reward that grants a high reward only if both the original and the post-shuffle answers agree and are correct.<n>We evaluate ACRE on challenging Video Reasoning benchmarks and multimodal math reasoning benchmarks, achieving an average 2.2% and 1.5% improvement.
arXiv Detail & Related papers (2025-10-11T08:32:52Z)
Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA [10.122669382758122]
We show that when questions are effectively unsolvable for a model, spurious chains of thought (CoTs) are more likely to appear.<n>We adapt outcome-supervised reward models and reinforcement learning with group-relative advantage to incorporate solvability into their objectives.<n>Our results highlight solvability as a key factor for reducing hallucinations and increasing reliability in CoT reasoning.
arXiv Detail & Related papers (2025-09-30T08:34:16Z)
From Sufficiency to Reflection: Reinforcement-Guided Thinking Quality in Retrieval-Augmented Reasoning for LLMs [13.410543801811992]
This paper analyzes existing RAG reasoning models and identifies three main failure patterns.<n>We propose TIRESRAG-R1, a novel framework using a think-retrieve-reflect process and a multi-dimensional reward system.<n>Experiments on four multi-hop QA datasets show that TIRESRAG-R1 outperforms prior RAG methods and generalizes well to single-hop tasks.
arXiv Detail & Related papers (2025-07-30T14:29:44Z)
ComposeRAG: A Modular and Composable RAG for Corpus-Grounded Multi-Hop Question Answering [42.238086712267396]
ComposeRAG is a novel modular abstraction that decomposes RAG pipelines into atomic, composable modules.<n>It consistently outperforms strong baselines in both accuracy and grounding fidelity.<n>Its verification-first design reduces ungrounded answers by over 10% in low-quality retrieval settings.
arXiv Detail & Related papers (2025-05-30T21:10:30Z)
DualRAG: A Dual-Process Approach to Integrate Reasoning and Retrieval for Multi-Hop Question Answering [57.875992666888855]
Multi-Hop Question Answering (MHQA) tasks pose challenges in orchestrating multi-step reasoning across diverse knowledge domains.<n>We propose DualRAG, a synergistic dual-process framework that seamlessly integrates reasoning and retrieval.
arXiv Detail & Related papers (2025-04-25T10:43:53Z)
Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models [84.15513004135576]
Current research enhances the reasoning performance of Large Language Models (LLMs) by sampling multiple reasoning chains and ensembling based on the answer frequency. This approach fails in scenarios where the correct answers are in the minority. We introduce a hierarchical reasoning aggregation framework AoR, which selects answers based on the evaluation of reasoning chains.
arXiv Detail & Related papers (2024-05-21T17:12:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.