Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling
- URL: http://arxiv.org/abs/2602.03719v1
- Date: Tue, 03 Feb 2026 16:43:09 GMT
- Title: Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling
- Authors: Yubao Zhao, Weiquan Huang, Sudong Wang, Ruochen Zhao, Chen Chen, Yao Shu, Chengwei Qin,
- Abstract summary: Branching Relative Policy Optimization (BranPO) is a value-free method that provides step-level contrastive supervision without dense rewards.<n>BranPO truncates trajectories near the tail and resamples alternative continuations to construct contrastive suffixes over shared prefixes.<n>To further boost efficiency and stabilize training, we introduce difficulty-aware branch sampling to adapt branching frequency across tasks, and redundant step masking to suppress uninformative actions.
- Score: 29.182538022605627
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Agentic reinforcement learning has enabled large language models to perform complex multi-turn planning and tool use. However, learning in long-horizon settings remains challenging due to sparse, trajectory-level outcome rewards. While prior tree-based methods attempt to mitigate this issue, they often suffer from high variance and computational inefficiency. Through empirical analysis of search agents, We identify a common pattern: performance diverges mainly due to decisions near the tail. Motivated by this observation, we propose Branching Relative Policy Optimization (BranPO), a value-free method that provides step-level contrastive supervision without dense rewards. BranPO truncates trajectories near the tail and resamples alternative continuations to construct contrastive suffixes over shared prefixes, reducing credit ambiguity in long-horizon rollouts. To further boost efficiency and stabilize training, we introduce difficulty-aware branch sampling to adapt branching frequency across tasks, and redundant step masking to suppress uninformative actions. Extensive experiments on various question answering benchmarks demonstrate that BranPO consistently outperforms strong baselines, achieving significant accuracy gains on long-horizon tasks without increasing the overall training budget. Our code is available at \href{https://github.com/YubaoZhao/BranPO}{code}.
Related papers
- Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration [49.9937230730202]
We propose Search-R2, a novel Actor-Refiner collaboration framework that enhances reasoning through targeted intervention.<n>Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories.<n>We show that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales.
arXiv Detail & Related papers (2026-02-03T15:32:09Z) - Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning [31.17280303212164]
We propose textbfSpark (textbfStrategic textbfPolicy-textbfAware explotextbfRation via textbfKey-state dynamic branching)<n>Our key insight is to activate adaptive branching exploration at critical decision points to probe promising trajectories.<n>textscSpark achieves superior success rates with significantly fewer training samples, exhibiting robust generalization even in unseen scenarios.
arXiv Detail & Related papers (2026-01-28T03:15:34Z) - Reinforced Efficient Reasoning via Semantically Diverse Exploration [73.41112984160992]
Reinforcement learning with verifiable rewards (RLVR) has proven effective in enhancing the reasoning of large language models (LLMs)<n>We propose reinforced efficient reasoning via semantically diverse explorations, i.e., ROSE, for LLMs.<n>Our method incorporates a semantic-entropy-based branching strategy and an $varepsilon$-exploration mechanism.
arXiv Detail & Related papers (2026-01-08T15:56:44Z) - Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models [29.56923793047279]
We introduce Dynamic Outlier Truncation (DOT), a training-time intervention that selectively suppresses redundant tokens.<n>DOT targets only the extreme tail of response lengths within fully correct rollout groups while preserving long-horizon reasoning capabilities.<n>Our method reduces inference token usage by 78% while simultaneously increasing accuracy compared to the initial policy.
arXiv Detail & Related papers (2026-01-07T14:31:07Z) - BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning [82.925106913459]
Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning.<n>We introduce BOTS, a unified framework for Bayesian Online Task Selection in RFT reinforcement finetuning.
arXiv Detail & Related papers (2025-10-30T11:15:23Z) - Reinforcing Video Reasoning with Focused Thinking [65.85683941058916]
We propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity.<n>Specifically, we employ a token weighting mechanism that prioritizes tokens with high informational density.<n>We also reformulate RL training by shifting from single-choice to multi-choice QA tasks.
arXiv Detail & Related papers (2025-05-30T15:42:19Z) - Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design [35.544075583073685]
We present the first systematic study of textitturn-level reward design for multi-turn RL algorithms and agent applications.<n>We conduct case studies on multi-turn reasoning-augmented search agents, where we carefully design two types of turn-level rewards: verifiable and LLM-as-judge.<n>Our experiments on multi-turn search tasks demonstrate that incorporating well-designed turn-level rewards enables RL algorithms to significantly outperform baseline methods with trajectory-level rewards.
arXiv Detail & Related papers (2025-05-17T04:09:46Z) - The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs [40.35884943268004]
We show that for very long sequences, larger and highly sparse models are preferable to smaller and dense ones.<n>There is no clear strategy that performs best across tasks and phases, with different units of sparsification or budget adaptivity needed for different scenarios.<n>We introduce and validate novel scaling laws specifically tailored for sparse attention, providing evidence that our findings are likely to hold true beyond our range of experiments.
arXiv Detail & Related papers (2025-04-24T17:39:25Z) - Path-Consistency with Prefix Enhancement for Efficient Inference in LLMs [3.6696973040141034]
path-consistency improves inference latency by up to 40.5%, while maintaining task accuracy across various tasks.<n>Our empirical results demonstrate that path-consistency improves inference latency by up to 40.5%, while maintaining task accuracy across various tasks.
arXiv Detail & Related papers (2024-08-25T01:45:53Z) - Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching [53.05954114863596]
We propose a brand-new Deep Boosting Learning (DBL) algorithm for image-text matching.
An anchor branch is first trained to provide insights into the data properties.
A target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples.
arXiv Detail & Related papers (2024-04-28T08:44:28Z) - Reinforcement Learning for Branch-and-Bound Optimisation using
Retrospective Trajectories [72.15369769265398]
Machine learning has emerged as a promising paradigm for branching.
We propose retro branching; a simple yet effective approach to RL for branching.
We outperform the current state-of-the-art RL branching algorithm by 3-5x and come within 20% of the best IL method's performance on MILPs with 500 constraints and 1000 variables.
arXiv Detail & Related papers (2022-05-28T06:08:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.