Related papers: A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation

A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation

URL: http://arxiv.org/abs/2512.06547v1
Date: Sat, 06 Dec 2025 19:37:39 GMT
Title: A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation
Authors: Xiaocan Li, Shiliang Wu, Zheng Shen,
Abstract summary: We introduce a proximal policy to decouple the off-policy corrections from the controlling policy updates.<n>The policy requires an extra forward pass through the network at each training step, creating a computational bottleneck.<n>We observe that since the proximal policy only serves as a trust region anchor between the behavior and target policies, we can approximate it through simple without explicit computation.
Score: 2.5291809836356998
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Decoupled loss has been a successful reinforcement learning (RL) algorithm to deal with the high data staleness under the asynchronous RL setting. Decoupled loss improves coupled-loss style of algorithms' (e.g., PPO, GRPO) learning stability by introducing a proximal policy to decouple the off-policy corrections (importance weight) from the controlling policy updates (trust region). However, the proximal policy requires an extra forward pass through the network at each training step, creating a computational bottleneck for large language models. We observe that since the proximal policy only serves as a trust region anchor between the behavior and target policies, we can approximate it through simple interpolation without explicit computation. We call this approach A-3PO (APproximated Proximal Policy Optimization). A-3PO eliminates this overhead, reducing training time by 18% while maintaining comparable performance. Code & off-the-shelf example are available at: https://github.com/inclusionAI/AReaL/blob/main/docs/algorithms/prox_approx.md

Related papers

Align and Filter: Improving Performance in Asynchronous On-Policy RL [27.989398323927393]
We identify the sources of policy lag caused by distributed learning and high update frequency.<n>We propose textittotal Variation-based Advantage aligned Constrained policy Optimization as a practical approach to mitigate policy lag.
arXiv Detail & Related papers (2026-03-02T01:52:34Z)
LLMs Can Learn to Reason Via Off-Policy RL [17.2941334301927]
Reinforcement learning approaches for Large Language Models (LLMs) frequently use on-policy algorithms, such as PPO or GRPO.<n>We propose a novel off-policy RL algorithm that does not require these modifications: Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL)<n>OAPL allows for efficient, effective post-training even with lags of more than 400 gradient steps between the training and inference policies, 100x more off-policy than prior approaches.
arXiv Detail & Related papers (2026-02-22T22:12:51Z)
Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z)
Coverage Improvement and Fast Convergence of On-policy Preference Learning [67.36750525893514]
Online on-policy preference learning algorithms for language model alignment can significantly outperform their offline counterparts.<n>We analyze how the sampling policy's coverage evolves throughout on-policy training.<n>We develop principled on-policy schemes for reward distillation in the general function class setting.
arXiv Detail & Related papers (2026-01-13T10:46:06Z)
Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone [72.17534881026995]
We develop an offline and online fine-tuning approach called policy-agnostic RL (PA-RL)<n>We show the first result that successfully fine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm.
arXiv Detail & Related papers (2024-12-09T17:28:03Z)
Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error. In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z)
Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets. We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged. We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z)
EnTRPO: Trust Region Policy Optimization Method with Entropy Regularization [1.599072005190786]
Trust Region Policy Optimization (TRPO) is a popular and empirically successful policy search algorithm in reinforcement learning. In this work, we use a replay buffer to borrow from the off-policy learning setting to TRPO. We add an Entropy regularization term to advantage over pi, accumulated over time steps, in TRPO.
arXiv Detail & Related papers (2021-10-26T03:04:00Z)
Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm [16.115903198836694]
Learning optimal behavior from existing data is one of the most important problems in Reinforcement Learning (RL) This is known as "off-policy control" in RL where an agent's objective is to compute an optimal policy based on the data obtained from the given policy (known as the behavior policy) This work proposes an off-policy natural actor-critic algorithm that utilizes state-action distribution correction for handling the off-policy behavior and the natural policy gradient for sample efficiency.
arXiv Detail & Related papers (2021-10-19T14:36:45Z)
Online Sub-Sampling for Reinforcement Learning with General Function Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm. For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times. In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z)
Breaking the Deadly Triad with a Target Network [80.82586530205776]
The deadly triad refers to the instability of a reinforcement learning algorithm when it employs off-policy learning, function approximation, and bootstrapping simultaneously. We provide the first convergent linear $Q$-learning algorithms under nonrestrictive and changing behavior policies without bi-level optimization.
arXiv Detail & Related papers (2021-01-21T21:50:10Z)
POPO: Pessimistic Offline Policy Optimization [6.122342691982727]
We study why off-policy RL methods fail to learn in offline setting from the value function view. We propose Pessimistic Offline Policy Optimization (POPO), which learns a pessimistic value function to get a strong policy. We find that POPO performs surprisingly well and scales to tasks with high-dimensional state and action space.
arXiv Detail & Related papers (2020-12-26T06:24:34Z)
Queueing Network Controls via Deep Reinforcement Learning [0.0]
We develop a Proximal policy optimization algorithm for queueing networks. The algorithm consistently generates control policies that outperform state-of-arts in literature. A key to the successes of our PPO algorithm is the use of three variance reduction techniques in estimating the relative value function.
arXiv Detail & Related papers (2020-07-31T01:02:57Z)
Mirror Descent Policy Optimization [41.46894905097985]
We propose an efficient RL algorithm, called em mirror descent policy optimization (MDPO) MDPO iteratively updates the policy by em approximately solving a trust-region problem. We highlight the connections between on-policy MDPO and two popular trust-region RL algorithms: TRPO and PPO, and show that explicitly enforcing the trust-region constraint is in fact em not a necessity for high performance gains in TRPO.
arXiv Detail & Related papers (2020-05-20T01:30:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.