Related papers: Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards

Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards

URL: http://arxiv.org/abs/2510.24302v2
Date: Wed, 29 Oct 2025 06:08:17 GMT
Title: Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards
Authors: Shangyu Xing, Siyuan Wang, Chenyuan Yang, Xinyu Dai, Xiang Ren,
Abstract summary: Lookahead Tree-Based Rollouts (LATR) is a novel rollout strategy designed to explicitly promote trajectory-level diversity.<n>LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2%.
Score: 48.321707628011005
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR), particularly with algorithms like Group Relative Policy Optimization (GRPO), has proven highly effective in enhancing the reasoning capabilities of large language models. However, a critical bottleneck in current pipelines lies in the limited diversity of sampled trajectories during group rollouts. Homogeneous trajectories and their associated rewards would diminish the return signals for policy updates, thereby hindering effective policy learning. This lack of diversity stems primarily from token-level stochastic sampling, where local variations are likely to collapse into near-identical reasoning paths. To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct continuations. Specifically, LATR iteratively operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibits prolonged similarity during simulation. Compared with stochastic Sampling, LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy Optimization (DAPO) algorithms across different reasoning tasks. Our code and data are publicly available at https://github.com/starreeze/latr.

Related papers

TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training [53.93696896939915]
Training tool-use agents typically rely on Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks.<n>We propose TopoCurate, an interaction-aware framework that projects multi-trial rollouts from the same task into a unified semantic quotient topology.<n>TopoCurate achieves consistent gains of 4.2% (SFT) and 6.9% (RL) over state-of-the-art baselines.
arXiv Detail & Related papers (2026-03-02T10:38:54Z)
R^3: Replay, Reflection, and Ranking Rewards for LLM Reinforcement Learning [32.16683059021539]
Large reasoning models (LRMs) aim to solve diverse and complex problems through structured reasoning.<n>Recent advances in group-based policy optimization methods have shown promise in enabling stable advantage estimation without reliance on process-level annotations.<n>We propose a reinforcement learning mechanism named emphtextbfR3 that along three directions: (1) a emphcross-context underlinetextbfReplay strategy that maintains the intra-group advantage, (2) an emphin-context self-underlinetextbfReflection mechanism
arXiv Detail & Related papers (2026-01-27T13:55:34Z)
TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models [14.130608036489336]
Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption.<n>We introduce textbfTreeGRPO, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree.
arXiv Detail & Related papers (2025-12-09T01:17:34Z)
Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization [20.004150645050537]
Group Turn Policy Optimization (GTPO) is a novel reinforcement learning algorithm designed for training Large Language Models (LLMs) on multi-turn Tool-Integrated Reasoning tasks.<n>GTPO introduces three key innovations: turn-level reward assignment that provides fine-grained feedback for individual turns, return-based advantage estimation, and self-supervised reward shaping.<n>Our comprehensive evaluation demonstrates that GTPO outperforms GRPO by 3.0% on average across diverse reasoning benchmarks.
arXiv Detail & Related papers (2025-11-18T19:01:16Z)
One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient [16.05489579792086]
We introduce one-token rollout (OTR), a novel fine-tuning algorithm that guides SFT with the policy gradient method.<n>OTR reframes the autoregressive learning process by treating each token generation as a single-step reinforcement learning trajectory.<n>Our findings establish OTR as a powerful and practical alternative for fine-tuning LLMs.
arXiv Detail & Related papers (2025-09-30T14:25:56Z)
Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards [47.557539197058496]
We introduce Random Policy Valuation for Diverse Reasoning (ROVER)<n>ROVER is a minimalist yet highly effective RL method that samples actions from a softmax over uniform-policy Q-values.<n>It demonstrates superior performance in both textbfquality (textbf+8.2 on pass@1, textbf+16.8 on pass@256) and textbfdiversity (textbf+17.6%)
arXiv Detail & Related papers (2025-09-29T16:09:07Z)
Learning More with Less: A Dynamic Dual-Level Down-Sampling Framework for Efficient Policy Optimization [42.2119634259269]
Critic-free methods like GRPO reduce memory demands by estimating advantages from multiple rollouts but tend to converge slowly.<n>We propose the textbfDynamic Dual-Level Down-Sampling (D$3$S) framework that prioritizes the most informative samples and tokens across groups to improve the efficient of policy optimization.
arXiv Detail & Related papers (2025-09-26T09:36:53Z)
TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling [65.46347858249295]
TreePO is a self-guided rollout algorithm that views sequence generation as a tree-structured searching process.<n>TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity.
arXiv Detail & Related papers (2025-08-24T16:52:37Z)
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning [125.96848846966087]
Training large language models (LLMs) as interactive agents presents unique challenges.<n>While reinforcement learning has enabled progress in static tasks, multi-turn agent RL training remains underexplored.<n>We propose StarPO, a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents.
arXiv Detail & Related papers (2025-04-24T17:57:08Z)
Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning [55.15106182268834]
Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models.<n>It faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive.<n>We introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts.
arXiv Detail & Related papers (2025-04-18T17:49:55Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
Harnessing Mixed Offline Reinforcement Learning Datasets via Trajectory Weighting [29.21380944341589]
We show that state-of-the-art offline RL algorithms are overly restrained by low-return trajectories and fail to exploit trajectories to the fullest. This reweighted sampling strategy may be combined with any offline RL algorithm. We empirically show that while CQL, IQL, and TD3+BC achieve only a part of this potential policy improvement, these same algorithms fully exploit the dataset.
arXiv Detail & Related papers (2023-06-22T17:58:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.