ESRL: Efficient Sampling-based Reinforcement Learning for Sequence
Generation
- URL: http://arxiv.org/abs/2308.02223v1
- Date: Fri, 4 Aug 2023 09:35:45 GMT
- Title: ESRL: Efficient Sampling-based Reinforcement Learning for Sequence
Generation
- Authors: Chenglong Wang, Hang Zhou, Yimin Hu, Yifu Huo, Bei Li, Tongran Liu,
Tong Xiao, Jingbo Zhu
- Abstract summary: We introduce two-stage sampling and dynamic sampling approaches to improve the sampling efficiency during training sequence generation models via RL.
Experimental results show that the efficient sampling-based RL, referred to as ESRL, can outperform all baselines in terms of both training efficiency and memory consumption.
- Score: 43.506732624371786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Applying Reinforcement Learning (RL) to sequence generation models enables
the direct optimization of long-term rewards (\textit{e.g.,} BLEU and human
feedback), but typically requires large-scale sampling over a space of action
sequences. This is a computational challenge as presented by the practice of
sequence generation problems, such as machine translation, where we often deal
with a large action space (\textit{e.g.,} a vocabulary) and a long action
sequence (\textit{e.g.,} a translation). In this work, we introduce two-stage
sampling and dynamic sampling approaches to improve the sampling efficiency
during training sequence generation models via RL. We experiment with our
approaches on the traditional sequence generation tasks, including machine
translation and abstractive summarization. Furthermore, we evaluate our
approaches in RL from human feedback (RLHF) through training a large language
model using the reward model. Experimental results show that the efficient
sampling-based RL, referred to as ESRL, can outperform all baselines in terms
of both training efficiency and memory consumption. Notably, ESRL yields
consistent performance gains over the strong REINFORCE, minimum risk training,
and proximal policy optimization methods.
Related papers
- Leveraging Genetic Algorithms for Efficient Demonstration Generation in Real-World Reinforcement Learning Environments [0.8602553195689513]
Reinforcement Learning (RL) has demonstrated significant potential in certain real-world industrial applications.<n>This study investigates the utilization of Genetic Algorithms (GAs) as a mechanism for improving RL performance.<n>We propose a novel approach in which GA-generated expert demonstrations are used to enhance policy learning.
arXiv Detail & Related papers (2025-07-01T14:04:17Z) - Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z) - Scaling Offline RL via Efficient and Expressive Shortcut Models [13.050231036248338]
offline reinforcement learning (RL) remains challenging due to the iterative nature of their noise sampling processes.<n>We introduce Scalable Offline Reinforcement Learning (SORL), a new offline RL algorithm that leverages shortcut models to scale both training and inference.<n>We demonstrate that SORL achieves strong performance across a range of offline RL tasks and exhibits positive scaling behavior with increased test-time compute.
arXiv Detail & Related papers (2025-05-28T20:59:22Z) - Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models.
We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z) - Exploring Training and Inference Scaling Laws in Generative Retrieval [50.82554729023865]
We investigate how model size, training data scale, and inference-time compute jointly influence generative retrieval performance.
Our experiments show that n-gram-based methods demonstrate strong alignment with both training and inference scaling laws.
We find that LLaMA models consistently outperform T5 models, suggesting a particular advantage for larger decoder-only models in generative retrieval.
arXiv Detail & Related papers (2025-03-24T17:59:03Z) - Provably Efficient RLHF Pipeline: A Unified View from Contextual Bandits [59.30310692855397]
We propose a unified framework for the RLHF pipeline from the view of contextual bandits.
We decompose the RLHF process into two distinct stages: (post-)training and deployment.
We then develop novel algorithms for each stage, demonstrating significant improvements in both statistical and computational efficiency.
arXiv Detail & Related papers (2025-02-11T02:36:01Z) - Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models [11.624678008637623]
We propose separating generation and learning in RLHF.
Asynchronous training relies on an underexplored regime, online but off-policy RLHF.
We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost.
arXiv Detail & Related papers (2024-10-23T19:59:50Z) - Teaching Large Language Models to Reason with Reinforcement Learning [38.17625148525193]
Reinforcement Learning from Human Feedback (textbfRLHF) has emerged as a dominant approach for aligning LLM outputs with human preferences.
Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback.
arXiv Detail & Related papers (2024-03-07T16:36:29Z) - Language Reward Modulation for Pretraining Reinforcement Learning [61.76572261146311]
We propose leveraging the capabilities of LRFs as a pretraining signal for reinforcement learning.
Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks.
arXiv Detail & Related papers (2023-08-23T17:37:51Z) - Reinforced Self-Training (ReST) for Language Modeling [56.75447441157628]
Reinforcement learning from human feedback (RLHF) can improve the quality of large language model's (LLM) outputs by aligning them with human preferences.
We propose a simple algorithm for aligning LLMs with human preferences inspired by growing batch reinforcement learning (RL), which we call Reinforced Self-Training (ReST)
Our results show that ReST can substantially improve translation quality, as measured by automated metrics and human evaluation on machine translation benchmarks in a compute and sample-efficient manner.
arXiv Detail & Related papers (2023-08-17T14:12:48Z) - KRLS: Improving End-to-End Response Generation in Task Oriented Dialog
with Reinforced Keywords Learning [25.421649004269373]
In task-oriented dialogs (TOD), reinforcement learning algorithms train a model to directly optimize response for task-related metrics.
We investigate an approach to create a more efficient RL-based algorithm to improve TOD performance in an offline setting.
Experiments on the MultiWoZ dataset show our new training algorithm, Keywords Reinforcement Learning with Next-word Sampling (KRLS), achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-11-30T06:27:46Z) - Teacher Forcing Recovers Reward Functions for Text Generation [21.186397113834506]
We propose a task-agnostic approach that derives a step-wise reward function directly from a model trained with teacher forcing.
We additionally propose a simple modification to stabilize the RL training on non-parallel datasets with our induced reward function.
arXiv Detail & Related papers (2022-10-17T02:48:58Z) - Simplifying Model-based RL: Learning Representations, Latent-space
Models, and Policies with One Objective [142.36200080384145]
We propose a single objective which jointly optimize a latent-space model and policy to achieve high returns while remaining self-consistent.
We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods.
arXiv Detail & Related papers (2022-09-18T03:51:58Z) - Improving Non-autoregressive Generation with Mixup Training [51.61038444990301]
We present a non-autoregressive generation model based on pre-trained transformer models.
We propose a simple and effective iterative training method called MIx Source and pseudo Target.
Our experiments on three generation benchmarks including question generation, summarization and paraphrase generation, show that the proposed framework achieves the new state-of-the-art results.
arXiv Detail & Related papers (2021-10-21T13:04:21Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.