Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training
- URL: http://arxiv.org/abs/2503.18929v1
- Date: Mon, 24 Mar 2025 17:51:39 GMT
- Title: Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training
- Authors: Brian R. Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, Bhavya Kailkhura,
- Abstract summary: Reinforcement learning (RL) is a critical component of large language model (LLM) post-training.<n>Existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers.<n>We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA)
- Score: 71.16258800411696
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers, which can be populated scalably by distributed off-policy actors to enhance exploration as compute increases. We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA), a massively scalable LLM RL system. In contrast to existing approaches, TBA uses a larger fraction of compute on search, constantly generating off-policy data for a central replay buffer. A training node simultaneously samples data from this buffer based on reward or recency to update the policy using Trajectory Balance (TB), a diversity-seeking RL objective introduced for GFlowNets. TBA offers three key advantages: (1) decoupled training and search, speeding up training wall-clock time by 4x or more; (2) improved diversity through large-scale off-policy sampling; and (3) scalable search for sparse reward settings. On mathematical reasoning, preference-tuning, and automated red-teaming (diverse and representative post-training tasks), TBA produces speed and performance improvements over strong baselines.
Related papers
- RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning [125.65034908728828]
Training large language models (LLMs) as interactive agents presents unique challenges.
While reinforcement learning has enabled progress in static tasks, multi-turn agent RL training remains underexplored.
We propose StarPO, a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents.
arXiv Detail & Related papers (2025-04-24T17:57:08Z) - StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation [55.75008325187133]
Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs)
StreamRL is designed with disaggregation from first principles to address two types of performance bottlenecks.
Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems.
arXiv Detail & Related papers (2025-04-22T14:19:06Z) - Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models [11.624678008637623]
We propose separating generation and learning in RLHF.<n>Online DPO is found to be most robust to off-policy data.<n>Asynchronous training relies on an underexplored regime, online but off-policy RLHF.
arXiv Detail & Related papers (2024-10-23T19:59:50Z) - Simplifying Deep Temporal Difference Learning [3.458933902627673]
We investigate whether it is possible to accelerate and simplify off-policy TD training while maintaining its stability.<n>Our key theoretical result demonstrates for the first time that regularisation techniques such as LayerNorm can yield provably convergent TD algorithms.<n>We propose PQN, our simplified deep online Q-Learning algorithm.
arXiv Detail & Related papers (2024-07-05T18:49:07Z) - Augmenting Replay in World Models for Continual Reinforcement Learning [0.0]
Continual RL requires an agent to learn new tasks without forgetting previous ones, while improving on both past and future tasks.
The most common approaches use model-free algorithms and replay buffers to mitigate catastrophic forgetting.
We introduce WMAR (World Models with Augmented Replay), a model-based RL algorithm with a memory-efficient replay buffer.
arXiv Detail & Related papers (2024-01-30T00:48:26Z) - Language Reward Modulation for Pretraining Reinforcement Learning [61.76572261146311]
We propose leveraging the capabilities of LRFs as a pretraining signal for reinforcement learning.
Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks.
arXiv Detail & Related papers (2023-08-23T17:37:51Z) - BatchGFN: Generative Flow Networks for Batch Active Learning [80.73649229919454]
BatchGFN is a novel approach for pool-based active learning that uses generative flow networks to sample sets of data points proportional to a batch reward.
We show our approach enables principled sampling near-optimal utility batches at inference time with a single forward pass per point in the batch in toy regression problems.
arXiv Detail & Related papers (2023-06-26T20:41:36Z) - Rethinking Population-assisted Off-policy Reinforcement Learning [7.837628433605179]
Off-policy reinforcement learning algorithms struggle with convergence to local optima due to limited exploration.
Population-based algorithms offer a natural exploration strategy, but their black-box operators are inefficient.
Recent algorithms have integrated these two methods, connecting them through a shared replay buffer.
arXiv Detail & Related papers (2023-05-04T15:53:00Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - Pretraining & Reinforcement Learning: Sharpening the Axe Before Cutting
the Tree [2.0142516017086165]
Pretraining is a common technique in deep learning for increasing performance and reducing training time.
We evaluate the effectiveness of pretraining for RL tasks, with and without distracting backgrounds, using both large, publicly available datasets and case-by-case generated datasets.
Results suggest filters learned during training on less relevant datasets render pretraining ineffective, while filters learned during training on the in-distribution datasets reliably reduce RL training time and improve performance after 80k RL training steps.
arXiv Detail & Related papers (2021-10-06T04:25:14Z) - Parallel Actors and Learners: A Framework for Generating Scalable RL
Implementations [14.432131909590824]
Reinforcement Learning (RL) has achieved significant success in application domains such as robotics, games, health care and others.
Current implementations exhibit poor performance due to challenges such as irregular memory accesses and synchronization overheads.
We propose a framework for generating scalable reinforcement learning implementations on multicore systems.
arXiv Detail & Related papers (2021-10-03T21:00:53Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.