Text Generation with Efficient (Soft) Q-Learning
- URL: http://arxiv.org/abs/2106.07704v1
- Date: Mon, 14 Jun 2021 18:48:40 GMT
- Title: Text Generation with Efficient (Soft) Q-Learning
- Authors: Han Guo, Bowen Tan, Zhengzhong Liu, Eric P. Xing, Zhiting Hu
- Abstract summary: Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
- Score: 91.47743595382758
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Maximum likelihood estimation (MLE) is the predominant algorithm for training
text generation models. This paradigm relies on direct supervision examples,
which is not applicable to many applications, such as generating adversarial
attacks or generating prompts to control language models. Reinforcement
learning (RL) on the other hand offers a more flexible solution by allowing
users to plug in arbitrary task metrics as reward. Yet previous RL algorithms
for text generation, such as policy gradient (on-policy RL) and Q-learning
(off-policy RL), are often notoriously inefficient or unstable to train due to
the large sequence space and the sparse reward received only at the end of
sequences. In this paper, we introduce a new RL formulation for text generation
from the soft Q-learning perspective. It further enables us to draw from the
latest RL advances, such as path consistency learning, to combine the best of
on-/off-policy updates, and learn effectively from sparse reward. We apply the
approach to a wide range of tasks, including learning from noisy/negative
examples, adversarial attacks, and prompt generation. Experiments show our
approach consistently outperforms both task-specialized algorithms and the
previous RL methods. On standard supervised tasks where MLE prevails, our
approach also achieves competitive performance and stability by training text
generation from scratch.
Related papers
- Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning [62.984693936073974]
Value-based reinforcement learning can learn effective policies for a wide range of multi-turn problems.
Current value-based RL methods have proven particularly challenging to scale to the setting of large language models.
We propose a novel offline RL algorithm that addresses these drawbacks, casting Q-learning as a modified supervised fine-tuning problem.
arXiv Detail & Related papers (2024-11-07T21:36:52Z) - Reinforcement Learning with Token-level Feedback for Controllable Text Generation [16.117006822479407]
We propose a novel reinforcement learning algorithm named TOLE which formulates TOken-LEvel rewards for controllable text generation.
Experimental results show that our algorithm can achieve superior performance on both single-attribute and multi-attribute control tasks.
arXiv Detail & Related papers (2024-03-18T08:18:37Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - ESRL: Efficient Sampling-based Reinforcement Learning for Sequence
Generation [43.506732624371786]
We introduce two-stage sampling and dynamic sampling approaches to improve the sampling efficiency during training sequence generation models via RL.
Experimental results show that the efficient sampling-based RL, referred to as ESRL, can outperform all baselines in terms of both training efficiency and memory consumption.
arXiv Detail & Related papers (2023-08-04T09:35:45Z) - KRLS: Improving End-to-End Response Generation in Task Oriented Dialog
with Reinforced Keywords Learning [25.421649004269373]
In task-oriented dialogs (TOD), reinforcement learning algorithms train a model to directly optimize response for task-related metrics.
We investigate an approach to create a more efficient RL-based algorithm to improve TOD performance in an offline setting.
Experiments on the MultiWoZ dataset show our new training algorithm, Keywords Reinforcement Learning with Next-word Sampling (KRLS), achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-11-30T06:27:46Z) - Is Reinforcement Learning (Not) for Natural Language Processing?:
Benchmarks, Baselines, and Building Blocks for Natural Language Policy
Optimization [73.74371798168642]
We introduce an open-source modular library, RL4LMs, for optimizing language generators with reinforcement learning.
Next, we present the GRUE benchmark, a set of 6 language generation tasks which are supervised not by target strings, but by reward functions.
Finally, we introduce an easy-to-use, performant RL algorithm, NLPO, that learns to effectively reduce the action space in language generation.
arXiv Detail & Related papers (2022-10-03T21:38:29Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep
Reinforcement Learning [102.78958681141577]
We present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy deep reinforcement learning algorithms.
SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration.
arXiv Detail & Related papers (2020-07-09T17:08:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.