Tree Search for LLM Agent Reinforcement Learning
- URL: http://arxiv.org/abs/2509.21240v1
- Date: Thu, 25 Sep 2025 14:37:09 GMT
- Title: Tree Search for LLM Agent Reinforcement Learning
- Authors: Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu,
- Abstract summary: Tree-based Group Relative Policy Optimization (Tree-GRPO) is a grouped agent RL method based on tree search.<n>By sharing common prefixes, the tree search sampling increases the number of rollouts achievable.<n>We demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning.
- Score: 23.7084695563981
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.
Related papers
- TreePS-RAG: Tree-based Process Supervision for Reinforcement Learning in Agentic RAG [71.06073770344732]
Agentic retrieval-augmented generation (RAG) formulates question answering as a multi-step interaction between reasoning and information retrieval.<n>We present TreePS-RAG, an online, tree-based RL framework for agentic RAG that enables step-wise credit assignment while retaining outcome-only rewards.
arXiv Detail & Related papers (2026-01-11T14:07:30Z) - TreeAdv: Tree-Structured Advantage Redistribution for Group-Based RL [7.149629501486536]
Reinforcement learning with group-based objectives is a common framework for aligning large language models on complex reasoning tasks.<n>Standard GRPO treats each rollout trajectory as an independent flat sequence and assigns a single sequence-level advantage to all tokens.<n>We introduce TreeAdv, which makes the tree structure of group rollouts explicit for both exploration and advantage assignment.
arXiv Detail & Related papers (2026-01-07T08:42:14Z) - TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling [65.46347858249295]
TreePO is a self-guided rollout algorithm that views sequence generation as a tree-structured searching process.<n>TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity.
arXiv Detail & Related papers (2025-08-24T16:52:37Z) - TreeRL: LLM Reinforcement Learning with On-Policy Tree Search [36.08914596340525]
Reinforcement learning (RL) with tree search has demonstrated superior performance in traditional reasoning tasks.<n>We propose TreeRL, a reinforcement learning framework that directly incorporates on-policy tree search for RL training.
arXiv Detail & Related papers (2025-06-13T15:52:37Z) - TreeLoRA: Efficient Continual Learning via Layer-Wise LoRAs Guided by a Hierarchical Gradient-Similarity Tree [52.44403214958304]
In this paper, we introduce TreeLoRA, a novel approach that constructs layer-wise adapters by leveraging hierarchical gradient similarity.<n>To reduce the computational burden of task similarity estimation, we employ bandit techniques to develop an algorithm based on lower confidence bounds.<n> experiments on both vision transformers (ViTs) and large language models (LLMs) demonstrate the effectiveness and efficiency of our approach.
arXiv Detail & Related papers (2025-06-12T05:25:35Z) - TreeRPO: Tree Relative Policy Optimization [55.97385410074841]
name is a novel method that estimates the mathematical expectations of rewards at various reasoning steps using tree sampling.<n>Building on the group-relative reward training mechanism of GRPO, name innovatively computes rewards based on step-level groups generated during tree sampling.
arXiv Detail & Related papers (2025-06-05T15:56:38Z) - RL-LLM-DT: An Automatic Decision Tree Generation Method Based on RL Evaluation and LLM Enhancement [82.02155942106877]
We propose RL-LLM-DT, an automatic decision tree generation method based on RL Evaluation and LLM Enhancement.<n>To evaluate the effectiveness of this integrated approach, we conducted experiments in a curling game.
arXiv Detail & Related papers (2024-12-16T03:33:49Z) - LiteSearch: Efficacious Tree Search for LLM [70.29796112457662]
This study introduces a novel guided tree search algorithm with dynamic node selection and node-level exploration budget.
Experiments conducted on the GSM8K and TabMWP datasets demonstrate that our approach enjoys significantly lower computational costs compared to baseline methods.
arXiv Detail & Related papers (2024-06-29T05:14:04Z) - Conditionally Optimistic Exploration for Cooperative Deep Multi-Agent
Reinforcement Learning [24.05715475457959]
Efficient exploration is critical in cooperative deep Multi-Agent Reinforcement Learning (MARL)
In this work, we propose an exploration method that effectively encourages cooperative exploration based on the idea of sequential action-computation.
arXiv Detail & Related papers (2023-03-16T02:05:16Z) - RLET: A Reinforcement Learning Based Approach for Explainable QA with
Entailment Trees [47.745218107037786]
We propose RLET, a Reinforcement Learning based Entailment Tree generation framework.
RLET iteratively performs single step reasoning with sentence selection and deduction generation modules.
Experiments on three settings of the EntailmentBank dataset demonstrate the strength of using RL framework.
arXiv Detail & Related papers (2022-10-31T06:45:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.