Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
- URL: http://arxiv.org/abs/2602.22576v1
- Date: Thu, 26 Feb 2026 03:31:00 GMT
- Title: Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
- Authors: Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li, Linfang Shang, Liqun Liu, Peng Shu, Huan Yu, Jie Jiang,
- Abstract summary: Agentic RAG enhances large language models by incorporating external knowledge.<n>Current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals.<n>We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training.
- Score: 11.136092421166097
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing. We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual-Track Path Scoring with offline-generated reference planners that assesses paths from both self-consistency and reference-alignment perspectives. Experiments on multiple QA benchmarks demonstrate that Search-P1 achieves significant improvements over Search-R1 and other strong baselines, with an average accuracy gain of 7.7 points.
Related papers
- Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning [32.295907409325615]
Training large language models to reason with search engines via reinforcement learning is hindered by a credit assignment problem.<n>We propose SLATE, a framework built on two complementary ideas.<n> Experiments on seven QA benchmarks confirm that SLATE consistently outperforms both sparse-reward and process-reward baselines.
arXiv Detail & Related papers (2026-02-26T19:05:40Z) - Where Relevance Emerges: A Layer-Wise Study of Internal Attention for Zero-Shot Re-Ranking [40.652380579951206]
In-Context Re-ranking (ICR) has recently been proposed as an $O(1)$ alternative method.<n>ICR extracts internal attention signals directly, avoiding the overhead of text generation.<n>No unified study has compared internal attention with traditional generative and likelihood-based mechanisms.
arXiv Detail & Related papers (2026-02-26T03:51:31Z) - Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration [49.9937230730202]
We propose Search-R2, a novel Actor-Refiner collaboration framework that enhances reasoning through targeted intervention.<n>Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories.<n>We show that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales.
arXiv Detail & Related papers (2026-02-03T15:32:09Z) - ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation [54.071574153853994]
ProRAG is a process-supervised reinforcement learning framework designed to integrate learned step-level supervision into the online optimization loop.<n>Our framework consists of four stages: (1) Supervised Policy Warmup to initialize the model with a structured reasoning format; (2) construction of an MCTS-based Process Reward Model (PRM) to quantify intermediate reasoning quality; (3) PRM-Guided Reasoning Refinement to align the policy with fine-grained process preferences; and (4) Process-Supervised Reinforcement Learning with a dual-granularity advantage mechanism.
arXiv Detail & Related papers (2026-01-29T16:04:59Z) - Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization [13.475938754147625]
Large Language Models (LLMs) empowered with Tool-Integrated Reasoning (TIR) can iteratively plan, call external tools, and integrate returned information to solve complex, long-horizon reasoning tasks.<n>Agentic Reinforcement Learning (Agentic RL) optimize such models over full tool-interaction trajectories.<n>Two key challenges hinder effectiveness: (1) Sparse, non-instructive rewards, such as binary 0-1 verifiable signals, provide limited guidance for intermediate steps and slow convergence.<n>We propose two complementary techniques: Progressive Reward Shaping (PRS) and Value-based Sampling Policy Optimization (VSPO).
arXiv Detail & Related papers (2025-12-08T11:59:25Z) - From Sufficiency to Reflection: Reinforcement-Guided Thinking Quality in Retrieval-Augmented Reasoning for LLMs [13.410543801811992]
This paper analyzes existing RAG reasoning models and identifies three main failure patterns.<n>We propose TIRESRAG-R1, a novel framework using a think-retrieve-reflect process and a multi-dimensional reward system.<n>Experiments on four multi-hop QA datasets show that TIRESRAG-R1 outperforms prior RAG methods and generalizes well to single-hop tasks.
arXiv Detail & Related papers (2025-07-30T14:29:44Z) - GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning.<n>Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate.<n>We propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision.
arXiv Detail & Related papers (2025-06-19T08:49:13Z) - RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning [64.46921169261852]
RAG-Zeval is a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task.<n>Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments.<n>Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments.
arXiv Detail & Related papers (2025-05-28T14:55:33Z) - J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning [54.85131761693927]
We introduce J1, a reinforcement learning framework for teaching LLM judges to think before making decisions.<n>Our core contribution lies in converting all judgment tasks for non-verifiable and verifiable prompts into a unified format with verifiable rewards.<n>We then use RL to train thinking-judges at scales of 8B, 32B, and 70B and show that they obtain state-of-the-art performance.
arXiv Detail & Related papers (2025-05-15T14:05:15Z) - Chain-of-Retrieval Augmented Generation [91.02950964802454]
This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer.<n>Our proposed method, CoRAG, allows the model to dynamically reformulate the query based on the evolving state.
arXiv Detail & Related papers (2025-01-24T09:12:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.