Beyond Correctness: Rewarding Faithful Reasoning in Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2510.13272v1
- Date: Wed, 15 Oct 2025 08:17:52 GMT
- Title: Beyond Correctness: Rewarding Faithful Reasoning in Retrieval-Augmented Generation
- Authors: Zhichao Xu, Zongyu Wu, Yun Zhou, Aosong Feng, Kang Zhou, Sangmin Woo, Kiran Ramnath, Yijun Tian, Xuan Qi, Weikang Qiu, Lin Lee Cheong, Haibo Ding,
- Abstract summary: We introduce a comprehensive evaluation framework for evaluating RL-based search agents.<n>To foster faithful reasoning, we introduce VERITAS, a novel framework that integrates fine-grained faithfulness rewards into the reinforcement learning process.
- Score: 21.72639961371058
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inspired by the success of reinforcement learning (RL) in Large Language Model (LLM) training for domains like math and code, recent works have begun exploring how to train LLMs to use search engines more effectively as tools for retrieval-augmented generation. Although these methods achieve performance improvement across QA benchmarks, many prioritize final answer correctness while overlooking the quality of intermediate reasoning steps, which may lead to chain-of-thought unfaithfulness. In this paper, we first introduce a comprehensive evaluation framework for evaluating RL-based search agents, covering three distinct faithfulness metrics: information-think faithfulness, think-answer faithfulness, and think-search faithfulness. Our evaluations reveal that a prototypical RL-based search agent, Search-R1, has significant room for improvement in this regard. To foster faithful reasoning, we introduce VERITAS (Verifying Entailed Reasoning through Intermediate Traceability in Agentic Search), a novel framework that integrates fine-grained faithfulness rewards into the reinforcement learning process. Our experiments show that models trained with VERITAS not only significantly improve reasoning faithfulness, but also achieve comparable task performance across seven QA benchmarks.
Related papers
- Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration [49.9937230730202]
We propose Search-R2, a novel Actor-Refiner collaboration framework that enhances reasoning through targeted intervention.<n>Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories.<n>We show that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales.
arXiv Detail & Related papers (2026-02-03T15:32:09Z) - Exploring Reasoning Reward Model for Agents [30.458783880389216]
Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use.<n>Most methods still relies on sparse outcome-based reward for training.<n>We introduce Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories.
arXiv Detail & Related papers (2026-01-29T18:59:52Z) - Tailored Primitive Initialization is the Secret Key to Reinforcement Learning [61.29280885291581]
Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs)<n>We argue that initializing LLMs with diverse, high-quality reasoning primitives is essential for achieving stable and sample-efficient RL training.<n>We propose Tailor, a finetuning pipeline that automatically discovers and curates novel reasoning primitives.
arXiv Detail & Related papers (2025-11-16T03:12:40Z) - CriticSearch: Fine-Grained Credit Assignment for Search Agents via a Retrospective Critic [24.371889836599138]
CriticSearch is a fine-grained credit-assignment framework that supplies dense, turn-level feedback via a retrospective critic mechanism.<n> Experimental results across diverse multi-hop reasoning benchmarks demonstrate that CriticSearch consistently outperforms existing baselines.
arXiv Detail & Related papers (2025-11-15T11:06:57Z) - Confidence as a Reward: Transforming LLMs into Reward Models [54.98336080630691]
Confidence-as-a-Reward (CRew) is a training-free method that utilizes token-level confidence in the model's final answers as a proxy for reward.<n>We show that CRew outperforms existing training-free reward approaches on the MATH500 and RewardMATH benchmarks.<n>We propose CRew-DPO, a training strategy that constructs preference data from confidence scores combined with correctness signals.
arXiv Detail & Related papers (2025-10-15T12:51:47Z) - Demystifying Reinforcement Learning in Agentic Reasoning [90.3737088727791]
We conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning.<n>We highlight our key insights: (i) replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT.<n> Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency.
arXiv Detail & Related papers (2025-10-13T17:57:15Z) - Beyond Outcome Reward: Decoupling Search and Answering Improves LLM Agents [19.31471304268234]
We introduce DeSA (Decoupling Search-and-Answering), a simple two-stage training framework that explicitly separates search optimization from answer generation.<n>Across seven QA benchmarks, DeSA-trained agents consistently improve search behaviors, delivering substantially higher search recall and answer accuracy than outcome-only baselines.
arXiv Detail & Related papers (2025-10-06T11:09:45Z) - Veri-R1: Toward Precise and Faithful Claim Verification via Online Reinforcement Learning [53.05161493434908]
Claim verification with large language models (LLMs) has recently attracted growing attention, due to their strong reasoning capabilities and transparent verification processes.<n>We introduce Veri-R1, an online reinforcement learning framework that enables an LLM to interact with a search engine and to receive reward signals that explicitly shape its planning, retrieval, and reasoning behaviors.<n> Empirical results show that Veri-R1 improves joint accuracy by up to 30% and doubles the evidence score, often surpassing its larger-scale model counterparts.
arXiv Detail & Related papers (2025-10-02T11:49:48Z) - ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards [18.92867715736209]
We propose ReSeek, a novel self-correcting framework for training search agents.<n>Our framework introduces a self-correction mechanism that empowers the agent to dynamically identify and recover from erroneous search paths.<n>To mitigate the risk of data contamination in existing datasets, we introduce FictionalHot.
arXiv Detail & Related papers (2025-10-01T06:44:28Z) - Improving Context Fidelity via Native Retrieval-Augmented Reasoning [35.50952279309109]
Large language models (LLMs) often struggle with context fidelity, producing inconsistent answers when responding to questions based on provided information.<n>We propose CARE, a novel native retrieval-augmented reasoning framework that teaches LLMs to explicitly integrate in-context evidence within their reasoning process with the model's own retrieval capabilities.<n>Our method requires limited labeled evidence data while significantly enhancing both retrieval accuracy and answer generation performance through strategically retrieved in-context tokens in the reasoning chain.
arXiv Detail & Related papers (2025-09-17T04:28:07Z) - Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning [87.7836502955847]
We propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning.<n>Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood.<n>We introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy.
arXiv Detail & Related papers (2025-06-10T12:40:39Z) - LeTS: Learning to Think-and-Search via Process-and-Outcome Reward Hybridization [30.95342819013663]
Large language models (LLMs) have demonstrated impressive capabilities in reasoning.<n>Recent research focuses on integrating reasoning capabilities into the realm of retrieval-augmented generation (RAG) via outcome-supervised reinforcement learning (RL) approaches.<n>We propose Learning to Think-and-Search (LeTS), a novel framework that hybridizes stepwise process reward and outcome-based reward to current RL methods for RAG.
arXiv Detail & Related papers (2025-05-23T04:04:05Z) - J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning [54.85131761693927]
We introduce J1, a reinforcement learning framework for teaching LLM judges to think before making decisions.<n>Our core contribution lies in converting all judgment tasks for non-verifiable and verifiable prompts into a unified format with verifiable rewards.<n>We then use RL to train thinking-judges at scales of 8B, 32B, and 70B and show that they obtain state-of-the-art performance.
arXiv Detail & Related papers (2025-05-15T14:05:15Z) - ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning [53.817538122688944]
We introduce Reinforced Meta-thinking Agents (ReMA) to elicit meta-thinking behaviors from Reasoning of Large Language Models (LLMs)<n>ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions.<n> Empirical results from single-turn experiments demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks.
arXiv Detail & Related papers (2025-03-12T16:05:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.