Off-Policy Self-Critical Training for Transformer in Visual Paragraph
Generation
- URL: http://arxiv.org/abs/2006.11714v1
- Date: Sun, 21 Jun 2020 05:10:17 GMT
- Title: Off-Policy Self-Critical Training for Transformer in Visual Paragraph
Generation
- Authors: Shiyang Yan, Yang Hua, Neil M. Robertson
- Abstract summary: Transformer is currently state-of-the-art seq-to-seq model in language generation.
We propose an off-policy RL learning algorithm where a behaviour policy represented by GRUs performs the sampling.
The proposed algorithm achieves state-of-the-art performance on the visual paragraph generation and improved results on image captioning.
- Score: 20.755764654229047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, several approaches have been proposed to solve language generation
problems. Transformer is currently state-of-the-art seq-to-seq model in
language generation. Reinforcement Learning (RL) is useful in solving exposure
bias and the optimisation on non-differentiable metrics in seq-to-seq language
learning. However, Transformer is hard to combine with RL as the costly
computing resource is required for sampling. We tackle this problem by
proposing an off-policy RL learning algorithm where a behaviour policy
represented by GRUs performs the sampling. We reduce the high variance of
importance sampling (IS) by applying the truncated relative importance sampling
(TRIS) technique and Kullback-Leibler (KL)-control concept. TRIS is a simple
yet effective technique, and there is a theoretical proof that KL-control helps
to reduce the variance of IS. We formulate this off-policy RL based on
self-critical sequence training. Specifically, we use a Transformer-based
captioning model as the target policy and use an image-guided language
auto-encoder as the behaviour policy to explore the environment. The proposed
algorithm achieves state-of-the-art performance on the visual paragraph
generation and improved results on image captioning.
Related papers
- How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Prompt Optimization via Adversarial In-Context Learning [51.18075178593142]
adv-ICL is implemented as a two-player game between a generator and a discriminator.
The generator tries to generate realistic enough output to fool the discriminator.
We show that adv-ICL results in significant improvements over state-of-the-art prompt optimization techniques.
arXiv Detail & Related papers (2023-12-05T09:44:45Z) - Supervised Pretraining Can Learn In-Context Reinforcement Learning [96.62869749926415]
In this paper, we study the in-context learning capabilities of transformers in decision-making problems.
We introduce and study Decision-Pretrained Transformer (DPT), a supervised pretraining method where the transformer predicts an optimal action.
We find that the pretrained transformer can be used to solve a range of RL problems in-context, exhibiting both exploration online and conservatism offline.
arXiv Detail & Related papers (2023-06-26T17:58:50Z) - Is Reinforcement Learning (Not) for Natural Language Processing?:
Benchmarks, Baselines, and Building Blocks for Natural Language Policy
Optimization [73.74371798168642]
We introduce an open-source modular library, RL4LMs, for optimizing language generators with reinforcement learning.
Next, we present the GRUE benchmark, a set of 6 language generation tasks which are supervised not by target strings, but by reward functions.
Finally, we introduce an easy-to-use, performant RL algorithm, NLPO, that learns to effectively reduce the action space in language generation.
arXiv Detail & Related papers (2022-10-03T21:38:29Z) - Robust Predictable Control [149.71263296079388]
We show that our method achieves much tighter compression than prior methods, achieving up to 5x higher reward than a standard information bottleneck.
We also demonstrate that our method learns policies that are more robust and generalize better to new tasks.
arXiv Detail & Related papers (2021-09-07T17:29:34Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.