AREAL-DTA: Dynamic Tree Attention for Efficient Reinforcement Learning of Large Language Models
- URL: http://arxiv.org/abs/2602.00482v1
- Date: Sat, 31 Jan 2026 03:05:34 GMT
- Title: AREAL-DTA: Dynamic Tree Attention for Efficient Reinforcement Learning of Large Language Models
- Authors: Jiarui Zhang, Yuchen Yang, Ran Yan, Zhiyu Mei, Liyuan Zhang, Daifeng Li, Wei Fu, Jiaxuan Gao, Shusheng Xu, Yi Wu, Binhang Yuan,
- Abstract summary: We introduce AREAL-DTA to efficiently exploit prefix sharing inReinforcement learning (RL) training.<n>AREAL-DTA employs a depth-first-search (DFS)-based execution strategy that dynamically traverses the rollout tree during both forward and backward.<n>AREAL-DTA achieves up to $831times$ in $2$-bench higher training throughput.
- Score: 30.413941705590933
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning (RL) based post-training for large language models (LLMs) is computationally expensive, as it generates many rollout sequences that could frequently share long token prefixes. Existing RL frameworks usually process these sequences independently, repeatedly recomputing identical prefixes during forward and backward passes during policy model training, leading to substantial inefficiencies in computation and memory usage. Although prefix sharing naturally induces a tree structure over rollouts, prior tree-attention-based solutions rely on fully materialized attention masks and scale poorly in RL settings. In this paper, we introduce AREAL-DTA to efficiently exploit prefix sharing in RL training. AREAL-DTA employs a depth-first-search (DFS)-based execution strategy that dynamically traverses the rollout prefix tree during both forward and backward computation, materializing only a single root-to-leaf path at a time. To further improve scalability, AREAL-DTA incorporates a load-balanced distributed batching mechanism that dynamically constructs and processes prefix trees across multiple GPUs. Across the popular RL post-training workload, AREAL-DTA achieves up to $8.31\times$ in $τ^2$-bench higher training throughput.
Related papers
- RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure [49.88201789074532]
Agentic Reinforcement Learning (RL) enables Large Language Models (LLMs) to perform autonomous decision-making and long-term planning.<n>We present RollArc, a distributed system designed to maximize throughput for multi-task agentic RL on disaggregated infrastructure.
arXiv Detail & Related papers (2025-12-27T11:14:23Z) - TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models [14.130608036489336]
Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption.<n>We introduce textbfTreeGRPO, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree.
arXiv Detail & Related papers (2025-12-09T01:17:34Z) - Discover, Learn, and Reinforce: Scaling Vision-Language-Action Pretraining with Diverse RL-Generated Trajectories [33.872433985210876]
Scaling vision-language-action (VLA) model pre-training requires large volumes of diverse, high-quality manipulation trajectories.<n>We propose Discover, Lea rn and Reinforce, which generates multiple distinct, high-success behavioral patterns for VLA pretraining.<n>When adapted to unseen downstream task suites, VLA models pretrained on our diverse RL data surpass counterparts trained on equal-sized standard RL datasets.
arXiv Detail & Related papers (2025-11-24T07:54:49Z) - Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse [21.642997639835396]
We propose Tree Training, a paradigm that computes each shared prefix only once and reuses its intermediate results across related branches during both forward and backward passes.<n>Experiments on multiple open-source models demonstrate up to 3.9x reduction in total training time, enabling more efficient agentic LLM SFT and RL training.
arXiv Detail & Related papers (2025-11-01T05:56:49Z) - RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs [48.94639777633359]
We present RLBoost, a systematic solution for cost-efficient RL training that harvests preemptible GPU resources.<n> RLBoost increases training throughput by 1.51x-1.97x while improving cost efficiency by 28%-49% compared to using only on-demand GPU resources.
arXiv Detail & Related papers (2025-10-22T04:19:37Z) - Learning to Reason as Action Abstractions with Scalable Mid-Training RL [55.24192942739207]
An effective mid-training phase should identify a compact set of useful actions and enable fast selection.<n>We propose Reasoning as Action Abstractions (RA3), a scalable mid-training algorithm.
arXiv Detail & Related papers (2025-09-30T05:34:20Z) - ActiveVLN: Towards Active Exploration via Multi-Turn RL in Vision-and-Language Navigation [57.399685080574756]
Existing MLLM-based VLN methods rely on imitation learning (IL) and often use DAgger for post-training.<n>We propose ActiveVLN, a VLN framework that explicitly enables active exploration through multi-turn RL.<n>Experiments show that ActiveVLN achieves the largest performance gains over IL baselines compared to both DAgger-based and prior RL-based post-training methods.
arXiv Detail & Related papers (2025-09-16T03:31:46Z) - AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning [23.24949857136035]
Reinforcement learning (RL) has become a dominant paradigm for training large language models (LLMs)<n>We present AReaL, a fully asynchronous RL system that completely decouples generation from training.
arXiv Detail & Related papers (2025-05-30T07:18:25Z) - StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation [55.75008325187133]
Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs)<n>StreamRL is designed with disaggregation from first principles to address two types of performance bottlenecks.<n> Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems.
arXiv Detail & Related papers (2025-04-22T14:19:06Z) - Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training [71.16258800411696]
Reinforcement learning (RL) is a critical component of large language model (LLM) post-training.<n>Existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers.<n>We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA)
arXiv Detail & Related papers (2025-03-24T17:51:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.