Related papers: Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

URL: http://arxiv.org/abs/2601.06021v1
Date: Fri, 09 Jan 2026 18:57:53 GMT
Title: Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards
Authors: Jiajie Zhang, Xin Lv, Ling Feng, Lei Hou, Juanzi Li,
Abstract summary: Reinforcement learning (RL) has emerged as a critical technique for enhancing LLM-based deep search agents.<n>Existing approaches primarily rely on binary outcome rewards, which fail to capture the comprehensiveness and factuality of agents' reasoning process.<n>We propose textbfCitation-aware RL Rewards (CaRR), a fine-grained reward framework for deep search agents.
Score: 60.0970117192627
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) has emerged as a critical technique for enhancing LLM-based deep search agents. However, existing approaches primarily rely on binary outcome rewards, which fail to capture the comprehensiveness and factuality of agents' reasoning process, and often lead to undesirable behaviors such as shortcut exploitation and hallucinations. To address these limitations, we propose \textbf{Citation-aware Rubric Rewards (CaRR)}, a fine-grained reward framework for deep search agents that emphasizes reasoning comprehensiveness, factual grounding, and evidence connectivity. CaRR decomposes complex questions into verifiable single-hop rubrics and requires agents to satisfy these rubrics by explicitly identifying hidden entities, supporting them with correct citations, and constructing complete evidence chains that link to the predicted answer. We further introduce \textbf{Citation-aware Group Relative Policy Optimization (C-GRPO)}, which combines CaRR and outcome rewards for training robust deep search agents. Experiments show that C-GRPO consistently outperforms standard outcome-based RL baselines across multiple deep search benchmarks. Our analysis also validates that C-GRPO effectively discourages shortcut exploitation, promotes comprehensive, evidence-grounded reasoning, and exhibits strong generalization to open-ended deep research tasks. Our code and data are available at https://github.com/THUDM/CaRR.

Related papers

SRR-Judge: Step-Level Rating and Refinement for Enhancing Search-Integrated Reasoning in Search Agents [30.92763154920672]
We introduce SRR-Judge, a framework for reliable step-level assessment of reasoning and search actions.<n>SRR-Judge provides fine-grained guidance for search-integrated reasoning and enables efficient post-training annotation.<n> Empirically, SRR-Judge delivers more reliable step-level evaluations than much larger models such as DeepSeek-V3.1.
arXiv Detail & Related papers (2026-02-08T02:07:41Z)
Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration [49.9937230730202]
We propose Search-R2, a novel Actor-Refiner collaboration framework that enhances reasoning through targeted intervention.<n>Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories.<n>We show that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales.
arXiv Detail & Related papers (2026-02-03T15:32:09Z)
Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning [52.144281362465996]
We propose EAPO (Evidence-Augmented Policy Optimization) to apply Reinforcement Learning to long-context scenarios.<n>We first establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling.<n>We then introduce a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward.<n>To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution mechanism.
arXiv Detail & Related papers (2026-01-15T11:40:57Z)
Repurposing Synthetic Data for Fine-grained Search Agent Supervision [81.95597592711688]
LLM-based search agents are increasingly trained on entity-centric synthetic data.<n> prevailing training methods discard this rich entity information, relying instead on sparse, outcome-based rewards.<n>We introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function.
arXiv Detail & Related papers (2025-10-28T17:50:40Z)
PoU: Proof-of-Use to Counter Tool-Call Hacking in DeepResearch Agents [24.502121097996294]
Retrieval-augmented generation (RAG) agents extend large language models with autonomous information-seeking capabilities through external tools.<n>We identify a previously overlooked failure mode, Tool-Call Hacking, where agents inflate reward signals by issuing superficially correct tool calls.<n>We propose Proof-of-Use (PoU), an evidence-grounded RL framework that enforces verifiable causal links between retrieved evidence, reasoning traces, and final answers.
arXiv Detail & Related papers (2025-10-13T02:45:37Z)
ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards [18.92867715736209]
We propose ReSeek, a novel self-correcting framework for training search agents.<n>Our framework introduces a self-correction mechanism that empowers the agent to dynamically identify and recover from erroneous search paths.<n>To mitigate the risk of data contamination in existing datasets, we introduce FictionalHot.
arXiv Detail & Related papers (2025-10-01T06:44:28Z)
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs [69.10441885629787]
Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge.<n>It falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts.<n>This survey synthesizes both strands under a unified reasoning-retrieval perspective.
arXiv Detail & Related papers (2025-07-13T03:29:41Z)
ClueAnchor: Clue-Anchored Knowledge Reasoning Exploration and Optimization for Retrieval-Augmented Generation [82.54090885503287]
Retrieval-Augmented Generation augments Large Language Models with external knowledge to improve factuality.<n>Existing RAG systems fail to extract and integrate the key clues needed to support faithful and interpretable reasoning.<n>We propose ClueAnchor, a novel framework for enhancing RAG via clue-anchored reasoning exploration and optimization.
arXiv Detail & Related papers (2025-05-30T09:18:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.