TreeAdv: Tree-Structured Advantage Redistribution for Group-Based RL
- URL: http://arxiv.org/abs/2601.03703v1
- Date: Wed, 07 Jan 2026 08:42:14 GMT
- Title: TreeAdv: Tree-Structured Advantage Redistribution for Group-Based RL
- Authors: Lang Cao, Hui Ruan, Yongqian Li, Peng Chao, Wu Ning, Haonan Song, Renhong Chen, Yitong Li,
- Abstract summary: Reinforcement learning with group-based objectives is a common framework for aligning large language models on complex reasoning tasks.<n>Standard GRPO treats each rollout trajectory as an independent flat sequence and assigns a single sequence-level advantage to all tokens.<n>We introduce TreeAdv, which makes the tree structure of group rollouts explicit for both exploration and advantage assignment.
- Score: 7.149629501486536
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning with group-based objectives, such as Group Relative Policy Optimization (GRPO), is a common framework for aligning large language models on complex reasoning tasks. However, standard GRPO treats each rollout trajectory as an independent flat sequence and assigns a single sequence-level advantage to all tokens, which leads to sample inefficiency and a length bias toward verbose, redundant chains of thought without improving logical depth. We introduce TreeAdv (Tree-Structured Advantage Redistribution for Group-Based RL), which makes the tree structure of group rollouts explicit for both exploration and advantage assignment. Specifically, TreeAdv builds a group of trees (a forest) based on an entropy-driven sampling method where each tree branches at high-uncertainty decisions while sharing low-uncertainty tokens across rollouts. Then, TreeAdv aggregates token-level advantages for internal tree segments by redistributing the advantages of complete rollouts (all leaf nodes), and TreeAdv can easily apply to group-based objectives such as GRPO or GSPO. Across 10 math reasoning benchmarks, TreeAdv consistently outperforms GRPO and GSPO, while using substantially fewer generated tokens under identical supervision, data, and decoding budgets.
Related papers
- TreePS-RAG: Tree-based Process Supervision for Reinforcement Learning in Agentic RAG [71.06073770344732]
Agentic retrieval-augmented generation (RAG) formulates question answering as a multi-step interaction between reasoning and information retrieval.<n>We present TreePS-RAG, an online, tree-based RL framework for agentic RAG that enables step-wise credit assignment while retaining outcome-only rewards.
arXiv Detail & Related papers (2026-01-11T14:07:30Z) - Tree Search for LLM Agent Reinforcement Learning [23.7084695563981]
Tree-based Group Relative Policy Optimization (Tree-GRPO) is a grouped agent RL method based on tree search.<n>By sharing common prefixes, the tree search sampling increases the number of rollouts achievable.<n>We demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning.
arXiv Detail & Related papers (2025-09-25T14:37:09Z) - TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling [65.46347858249295]
TreePO is a self-guided rollout algorithm that views sequence generation as a tree-structured searching process.<n>TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity.
arXiv Detail & Related papers (2025-08-24T16:52:37Z) - GPT-HTree: A Decision Tree Framework Integrating Hierarchical Clustering and Large Language Models for Explainable Classification [0.0]
GPT-HTree is a framework combining hierarchical clustering, decision trees, and large language models (LLMs)<n>LLMs enhance the framework by generating human-readable cluster descriptions, bridging quantitative analysis with actionable insights.
arXiv Detail & Related papers (2025-01-23T15:18:22Z) - Learning a Decision Tree Algorithm with Transformers [75.96920867382859]
We introduce MetaTree, a transformer-based model trained via meta-learning to directly produce strong decision trees.
We fit both greedy decision trees and globally optimized decision trees on a large number of datasets, and train MetaTree to produce only the trees that achieve strong generalization performance.
arXiv Detail & Related papers (2024-02-06T07:40:53Z) - Hierarchical clustering with dot products recovers hidden tree structure [53.68551192799585]
In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure.
We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance.
We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model.
arXiv Detail & Related papers (2023-05-24T11:05:12Z) - RLET: A Reinforcement Learning Based Approach for Explainable QA with
Entailment Trees [47.745218107037786]
We propose RLET, a Reinforcement Learning based Entailment Tree generation framework.
RLET iteratively performs single step reasoning with sentence selection and deduction generation modules.
Experiments on three settings of the EntailmentBank dataset demonstrate the strength of using RL framework.
arXiv Detail & Related papers (2022-10-31T06:45:05Z) - Robustifying Algorithms of Learning Latent Trees with Vector Variables [92.18777020401484]
We present the sample complexities of Recursive Grouping (RG) and Chow-Liu Recursive Grouping (CLRG)
We robustify RG, CLRG, Neighbor Joining (NJ) and Spectral NJ (SNJ) by using the truncated inner product.
We derive the first known instance-dependent impossibility result for structure learning of latent trees.
arXiv Detail & Related papers (2021-06-02T01:37:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.