Teacher Forcing Recovers Reward Functions for Text Generation
- URL: http://arxiv.org/abs/2210.08708v1
- Date: Mon, 17 Oct 2022 02:48:58 GMT
- Title: Teacher Forcing Recovers Reward Functions for Text Generation
- Authors: Yongchang Hao, Yuxin Liu, Lili Mou
- Abstract summary: We propose a task-agnostic approach that derives a step-wise reward function directly from a model trained with teacher forcing.
We additionally propose a simple modification to stabilize the RL training on non-parallel datasets with our induced reward function.
- Score: 21.186397113834506
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning (RL) has been widely used in text generation to
alleviate the exposure bias issue or to utilize non-parallel datasets. The
reward function plays an important role in making RL training successful.
However, previous reward functions are typically task-specific and sparse,
restricting the use of RL. In our work, we propose a task-agnostic approach
that derives a step-wise reward function directly from a model trained with
teacher forcing. We additionally propose a simple modification to stabilize the
RL training on non-parallel datasets with our induced reward function.
Empirical results show that our method outperforms self-training and reward
regression methods on several text generation tasks, confirming the
effectiveness of our reward function.
Related papers
- Exploring RL-based LLM Training for Formal Language Tasks with Programmed Rewards [49.7719149179179]
This paper investigates the feasibility of using PPO for reinforcement learning (RL) from explicitly programmed reward signals.
We focus on tasks expressed through formal languages, such as programming, where explicit reward functions can be programmed to automatically assess quality of generated outputs.
Our results show that pure RL-based training for the two formal language tasks is challenging, with success being limited even for the simple arithmetic task.
arXiv Detail & Related papers (2024-10-22T15:59:58Z) - Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation [37.36913210031282]
Preference-based reinforcement learning (PbRL) has shown impressive capabilities in training agents without reward engineering.
We propose SEER, an efficient PbRL method that integrates label smoothing and policy regularization techniques.
arXiv Detail & Related papers (2024-05-29T01:49:20Z) - Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning [7.07264650720021]
Sub-optimal Data Pre-training, SDP, is an approach that leverages reward-free, sub-optimal data to improve HitL RL algorithms.
We show SDP can significantly improve or achieve competitive performance with state-of-the-art HitL RL algorithms.
arXiv Detail & Related papers (2024-04-30T18:58:33Z) - Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint [104.53687944498155]
Reinforcement learning (RL) has been widely used in training large language models (LLMs)
We propose a new RL method named RLMEC that incorporates a generative model as the reward model.
Based on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process.
arXiv Detail & Related papers (2024-01-11T17:58:41Z) - ESRL: Efficient Sampling-based Reinforcement Learning for Sequence
Generation [43.506732624371786]
We introduce two-stage sampling and dynamic sampling approaches to improve the sampling efficiency during training sequence generation models via RL.
Experimental results show that the efficient sampling-based RL, referred to as ESRL, can outperform all baselines in terms of both training efficiency and memory consumption.
arXiv Detail & Related papers (2023-08-04T09:35:45Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z) - Rewriting History with Inverse RL: Hindsight Inference for Policy
Improvement [137.29281352505245]
We show that hindsight relabeling is inverse RL, an observation that suggests that we can use inverse RL in tandem for RL algorithms to efficiently solve many tasks.
Our experiments confirm that relabeling data using inverse RL accelerates learning in general multi-task settings.
arXiv Detail & Related papers (2020-02-25T18:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.