Human-centric Dialog Training via Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2010.05848v1
- Date: Mon, 12 Oct 2020 16:53:00 GMT
- Title: Human-centric Dialog Training via Offline Reinforcement Learning
- Authors: Natasha Jaques, Judy Hanwen Shen, Asma Ghandeharioun, Craig Ferguson,
Agata Lapedriza, Noah Jones, Shixiang Shane Gu, and Rosalind Picard
- Abstract summary: We develop a novel class of offline reinforcement learning algorithms.
We test the resulting dialog model with ratings from 80 users in an open-domain setting.
- Score: 16.525761580699257
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How can we train a dialog model to produce better conversations by learning
from human feedback, without the risk of humans teaching it harmful chat
behaviors? We start by hosting models online, and gather human feedback from
real-time, open-ended conversations, which we then use to train and improve the
models using offline reinforcement learning (RL). We identify implicit
conversational cues including language similarity, elicitation of laughter,
sentiment, and more, which indicate positive human feedback, and embed these in
multiple reward functions. A well-known challenge is that learning an RL policy
in an offline setting usually fails due to the lack of ability to explore and
the tendency to make over-optimistic estimates of future reward. These problems
become even harder when using RL for language models, which can easily have a
20,000 action vocabulary and many possible reward functions. We solve the
challenge by developing a novel class of offline RL algorithms. These
algorithms use KL-control to penalize divergence from a pre-trained prior
language model, and use a new strategy to make the algorithm pessimistic,
instead of optimistic, in the face of uncertainty. We test the resulting dialog
model with ratings from 80 users in an open-domain setting and find it achieves
significant improvements over existing deep offline RL approaches. The novel
offline RL method is viable for improving any existing generative dialog model
using a static dataset of human feedback.
Related papers
- Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning [62.984693936073974]
Value-based reinforcement learning can learn effective policies for a wide range of multi-turn problems.
Current value-based RL methods have proven particularly challenging to scale to the setting of large language models.
We propose a novel offline RL algorithm that addresses these drawbacks, casting Q-learning as a modified supervised fine-tuning problem.
arXiv Detail & Related papers (2024-11-07T21:36:52Z) - Online Bandit Learning with Offline Preference Data [15.799929216215672]
We propose a posterior sampling algorithm for online learning that can be warm-started with an offline dataset with noisy preference feedback.
We show that by modeling the 'competence' of the expert that generated it, we are able to use such a dataset most effectively.
arXiv Detail & Related papers (2024-06-13T20:25:52Z) - Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations [70.7884839812069]
Large language models (LLMs) have emerged as powerful and general solutions to many natural language tasks.
However, many of the most important applications of language generation are interactive, where an agent has to talk to a person to reach a desired outcome.
In this work, we explore a new method for adapting LLMs with RL for such goal-directed dialogue.
arXiv Detail & Related papers (2023-11-09T18:45:16Z) - Building Persona Consistent Dialogue Agents with Offline Reinforcement
Learning [23.149638288383347]
Current state-of-the-art systems do this by training agents with supervised learning or online reinforcement learning (RL)
We propose an offline RL framework to improve the persona consistency of dialogue systems.
arXiv Detail & Related papers (2023-10-16T18:05:54Z) - Replicating Complex Dialogue Policy of Humans via Offline Imitation
Learning with Supervised Regularization [7.151589223349882]
Policy learning (PL) is a module of a task-oriented dialogue system that trains an agent to make actions in each dialogue turn.
Both supervised learning (SL) and reinforcement learning (RL) frameworks cannot imitate humans well.
This study proposed an offline imitation learning model that learns policy from real dialogue datasets.
arXiv Detail & Related papers (2023-05-06T09:27:58Z) - Offline Reinforcement Learning for Mixture-of-Expert Dialogue Management [36.254564021059515]
Reinforcement learning (RL) has shown great promise for developing dialogue management (DM) agents that are non-myopic.
We develop a variety of RL algorithms, specialized to dialogue planning, that leverage recent Mixture-of-Expert Language Models (MoE-LMs)
By exploiting MoE-LM structure, our methods significantly reduce the size of the action space and improve the efficacy of RL-based DM.
arXiv Detail & Related papers (2023-02-21T18:02:20Z) - Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity.
We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model.
By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z) - CHAI: A CHatbot AI for Task-Oriented Dialogue with Offline Reinforcement
Learning [85.3987745097806]
offline reinforcement learning can be used to train dialogue agents entirely using static datasets collected from human speakers.
Experiments show that recently developed offline RL methods can be combined with language models to yield realistic dialogue agents.
arXiv Detail & Related papers (2022-04-18T17:43:21Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z) - Guided Dialog Policy Learning without Adversarial Learning in the Loop [103.20723982440788]
A number of adversarial learning methods have been proposed to learn the reward function together with the dialogue policy.
We propose to decompose the adversarial training into two steps.
First, we train the discriminator with an auxiliary dialogue generator and then incorporate a derived reward model into a common RL method to guide the dialogue policy learning.
arXiv Detail & Related papers (2020-04-07T11:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.