Repurposing Synthetic Data for Fine-grained Search Agent Supervision
- URL: http://arxiv.org/abs/2510.24694v1
- Date: Tue, 28 Oct 2025 17:50:40 GMT
- Title: Repurposing Synthetic Data for Fine-grained Search Agent Supervision
- Authors: Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song, Zhuo Chen, Chenxi Wang, Xinyu Wang, Kewei Tu, Pengjun Xie, Jingren Zhou, Yong Jiang,
- Abstract summary: LLM-based search agents are increasingly trained on entity-centric synthetic data.<n> prevailing training methods discard this rich entity information, relying instead on sparse, outcome-based rewards.<n>We introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function.
- Score: 81.95597592711688
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative "near-miss" samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect samples proportional to their entity match rate, enabling the model to effectively learn from these "near-misses". Experiments on diverse question-answering (QA) and deep research benchmarks show that E-GRPO consistently and significantly outperforms the GRPO baseline. Furthermore, our analysis reveals that E-GRPO not only achieves superior accuracy but also induces more efficient reasoning policies that require fewer tool calls, demonstrating a more effective and sample-efficient approach to aligning search agents.
Related papers
- Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z) - Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization [7.961090665261694]
We propose Graph-GRPO, a novel topology optimization framework that integrates Group Relative Policy Optimization.<n>By normalizing rewards across the sampled group, our method effectively mitigates the noise derived from task difficulty variance and enables fine-grained credit assignment.
arXiv Detail & Related papers (2026-03-03T07:45:40Z) - Optimizing Agentic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward [24.738836592075927]
We introduce a unified framework that incentivizes effective information seeking via a synthetic semantic information gain reward.<n>Experiments across seven question-answering benchmarks demonstrate that InfoReasoner consistently outperforms strong retrieval-augmented baselines.<n>Our work provides a theoretically grounded and scalable path toward agentic reasoning with retrieval.
arXiv Detail & Related papers (2026-01-31T18:15:50Z) - Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards [60.0970117192627]
Reinforcement learning (RL) has emerged as a critical technique for enhancing LLM-based deep search agents.<n>Existing approaches primarily rely on binary outcome rewards, which fail to capture the comprehensiveness and factuality of agents' reasoning process.<n>We propose textbfCitation-aware RL Rewards (CaRR), a fine-grained reward framework for deep search agents.
arXiv Detail & Related papers (2026-01-09T18:57:53Z) - DaGRPO: Rectifying Gradient Conflict in Reasoning via Distinctiveness-Aware Group Relative Policy Optimization [20.66452395111739]
We propose Distinctiveness-aware Group Relative Policy Optimization (DaGRPO)<n>DaGRPO incorporates two core mechanisms: (1) Sequence-level Gradient Rectification, which utilizes fine-grained scoring to dynamically mask sample pairs with low distinctiveness; and (2) Off-policy Data Augmentation, which introduces high-quality anchors to recover training signals for challenging tasks.<n>In-depth analysis confirms that DaGRPO effectively mitigates gradient explosion and accelerates the emergence of long-chain reasoning capabilities.
arXiv Detail & Related papers (2025-12-06T07:51:36Z) - Multimodal Reinforcement Learning with Agentic Verifier for AI Agents [131.46008226323423]
Argos is a principled multimodal reward agent to train reasoning models for agentic tasks.<n>By leveraging our agentic verifier across both SFT data and RL training, our model achieves state-of-the-art results.
arXiv Detail & Related papers (2025-12-03T04:42:47Z) - Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents [28.145430029174577]
Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments.<n>Existing approaches typically rely on outcome-based rewards that are only provided at the final answer.<n>In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training.
arXiv Detail & Related papers (2025-10-16T17:59:32Z) - DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization [50.91849555841057]
Group Relative Policy Optimization is a reinforcement learning method for large reasoning models (LRMs)<n>We introduce a new Discriminative Constrained Optimization framework for reinforcing LRMs, grounded in the principle of discriminative learning.<n>DisCO significantly outperforms GRPO and its improved variants such as DAPO, achieving average gains of 7% over GRPO and 6% over DAPO.
arXiv Detail & Related papers (2025-05-18T11:08:32Z) - A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.<n>We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO.<n>Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.