CriticSearch: Fine-Grained Credit Assignment for Search Agents via a Retrospective Critic
- URL: http://arxiv.org/abs/2511.12159v1
- Date: Sat, 15 Nov 2025 11:06:57 GMT
- Title: CriticSearch: Fine-Grained Credit Assignment for Search Agents via a Retrospective Critic
- Authors: Yaocheng Zhang, Haohuan Huang, Zijun Song, Yuanheng Zhu, Qichao Zhang, Zijie Zhao, Dongbin Zhao,
- Abstract summary: CriticSearch is a fine-grained credit-assignment framework that supplies dense, turn-level feedback via a retrospective critic mechanism.<n> Experimental results across diverse multi-hop reasoning benchmarks demonstrate that CriticSearch consistently outperforms existing baselines.
- Score: 24.371889836599138
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tool-Integrated Reasoning (TIR) with search engines enables large language models to iteratively retrieve up-to-date external knowledge, enhancing adaptability and generalization in complex question-answering tasks. However, existing search agent pipelines typically depend on reinforcement learning based optimization, which often suffers from sparse outcome rewards, leading to inefficient exploration and unstable training. We introduce CriticSearch, a fine-grained credit-assignment framework that supplies dense, turn-level feedback via a retrospective critic mechanism. During training, a frozen, asymmetric critique LLM retrospectively evaluates each turn using privileged information from the full trajectory and gold answers, converting these assessments into stable, dense rewards that guide policy improvement. Experimental results across diverse multi-hop reasoning benchmarks demonstrate that CriticSearch consistently outperforms existing baselines, achieving faster convergence, improved training stability, and higher performance.
Related papers
- SRR-Judge: Step-Level Rating and Refinement for Enhancing Search-Integrated Reasoning in Search Agents [30.92763154920672]
We introduce SRR-Judge, a framework for reliable step-level assessment of reasoning and search actions.<n>SRR-Judge provides fine-grained guidance for search-integrated reasoning and enables efficient post-training annotation.<n> Empirically, SRR-Judge delivers more reliable step-level evaluations than much larger models such as DeepSeek-V3.1.
arXiv Detail & Related papers (2026-02-08T02:07:41Z) - Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration [49.9937230730202]
We propose Search-R2, a novel Actor-Refiner collaboration framework that enhances reasoning through targeted intervention.<n>Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories.<n>We show that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales.
arXiv Detail & Related papers (2026-02-03T15:32:09Z) - No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning [21.237273221334963]
ECHO is a framework that jointly optimize the policy and critic through a synchronized co-evolutionary loop.<n>ECHO yields more stable training and higher long-horizon task success across open-world environments.
arXiv Detail & Related papers (2026-01-11T07:29:08Z) - Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards [60.0970117192627]
Reinforcement learning (RL) has emerged as a critical technique for enhancing LLM-based deep search agents.<n>Existing approaches primarily rely on binary outcome rewards, which fail to capture the comprehensiveness and factuality of agents' reasoning process.<n>We propose textbfCitation-aware RL Rewards (CaRR), a fine-grained reward framework for deep search agents.
arXiv Detail & Related papers (2026-01-09T18:57:53Z) - RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks [75.52891348667491]
Open-ended generation tasks require outputs to satisfy diverse and often implicit task-specific evaluation rubrics.<n>The sheer number of relevant rubrics leads to prohibitively high verification costs and incomplete assessments of a response.<n>We propose Reinforcement Learning with Adrial Critic (RLAC), a post-training approach that addresses these challenges via dynamic rubric verification.
arXiv Detail & Related papers (2025-11-03T17:15:05Z) - Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning [89.60378227969643]
We propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision.<n>Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly.<n>Experiments across various tasks and models show that Critique-RL delivers substantial performance improvements.
arXiv Detail & Related papers (2025-10-28T11:37:01Z) - Beyond Correctness: Rewarding Faithful Reasoning in Retrieval-Augmented Generation [21.72639961371058]
We introduce a comprehensive evaluation framework for evaluating RL-based search agents.<n>To foster faithful reasoning, we introduce VERITAS, a novel framework that integrates fine-grained faithfulness rewards into the reinforcement learning process.
arXiv Detail & Related papers (2025-10-15T08:17:52Z) - Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback [59.078756231841574]
Critique-GRPO is an online RL framework that integrates both natural language and numerical feedback for effective policy optimization.<n>We show Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks.
arXiv Detail & Related papers (2025-06-03T17:39:02Z) - Contextual Candor: Enhancing LLM Trustworthiness Through Hierarchical Unanswerability Detection [0.0]
This paper introduces Reinforced Unanswerability Learning (RUL), a novel hybrid training paradigm for large language models (LLMs)<n>RUL integrates a discriminative unanswerability prediction head with the LLM's generative core, guided by a multi-stage learning strategy.<n>Experiments demonstrate RUL's superior performance, achieving significantly higher accuracy in unanswerability detection across sentence, paragraph, and ranking levels.
arXiv Detail & Related papers (2025-06-01T17:59:27Z) - RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning [64.46921169261852]
RAG-Zeval is a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task.<n>Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments.<n>Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments.
arXiv Detail & Related papers (2025-05-28T14:55:33Z) - Search and Refine During Think: Facilitating Knowledge Refinement for Improved Retrieval-Augmented Reasoning [35.35813310224967]
Large language models have demonstrated impressive reasoning capabilities but are inherently limited by their knowledge reservoir.<n>Retrieval-augmented reasoning mitigates this limitation by allowing LLMs to query external resources.<n>We propose AutoRefine, a reinforcement learning framework that adopts a new "search-and-refine-during-think" paradigm.
arXiv Detail & Related papers (2025-05-16T14:11:29Z) - Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z) - RaCT: Ranking-aware Chain-of-Thought Optimization for LLMs [30.216174551427443]
Large language models (LLMs) have demonstrated remarkable potential in text reranking tasks.<n> conventional supervised fine-tuning approaches for specializing LLMs in ranking tasks often lead to significant degradation of the models' general-purpose abilities.<n>This paper presents a novel methodology that strategically combines Chain-of-Thought (CoT) prompting techniques with an innovative two-stage training pipeline.
arXiv Detail & Related papers (2024-12-18T23:24:15Z) - Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks [68.49251303172674]
State-of-the-art large language models (LLMs) exhibit impressive problem-solving capabilities but may struggle with complex reasoning and factual correctness.
Existing methods harness the strengths of chain-of-thought and retrieval-augmented generation (RAG) to decompose a complex problem into simpler steps and apply retrieval to improve factual correctness.
We introduce Critic-guided planning with Retrieval-augmentation, CR-Planner, a novel framework that leverages fine-tuned critic models to guide both reasoning and retrieval processes through planning.
arXiv Detail & Related papers (2024-10-02T11:26:02Z) - Progress or Regress? Self-Improvement Reversal in Post-training [26.051637877066327]
We propose a comprehensive evaluative framework to scrutinize the underlying enhancements of post-training paradigms for self-improvement.
We show that models showing improved performance across benchmarks will paradoxically exhibit declines in broader, essential capabilities.
These findings indicate that current self-improvement practices through post-training are inadequate for equipping models to tackle more complex problems.
arXiv Detail & Related papers (2024-07-06T09:07:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.