Building Persona Consistent Dialogue Agents with Offline Reinforcement
Learning
- URL: http://arxiv.org/abs/2310.10735v1
- Date: Mon, 16 Oct 2023 18:05:54 GMT
- Title: Building Persona Consistent Dialogue Agents with Offline Reinforcement
Learning
- Authors: Ryan Shea and Zhou Yu
- Abstract summary: Current state-of-the-art systems do this by training agents with supervised learning or online reinforcement learning (RL)
We propose an offline RL framework to improve the persona consistency of dialogue systems.
- Score: 23.149638288383347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Maintaining a consistent persona is a key quality for any open domain
dialogue system. Current state-of-the-art systems do this by training agents
with supervised learning or online reinforcement learning (RL). However,
systems trained with supervised learning often lack consistency as they are
never punished for uttering contradictions. Additional training with RL can
alleviate some of these issues, however the training process is expensive.
Instead, we propose an offline RL framework to improve the persona consistency
of dialogue systems. Our framework allows us to combine the advantages of
previous methods as we can inexpensively train our model on existing data as in
supervised learning, while punishing and rewarding specific utterances as in
RL. We also introduce a simple importance sampling method to reduce the
variance of importance weights in offline RL training which we call
Variance-Reducing MLE-Initialized (VaRMI) importance sampling. Our automatic
and human evaluations show that our framework improves both the persona
consistency and dialogue quality of a state-of-the-art social chatbot.
Related papers
- Replicating Complex Dialogue Policy of Humans via Offline Imitation
Learning with Supervised Regularization [7.151589223349882]
Policy learning (PL) is a module of a task-oriented dialogue system that trains an agent to make actions in each dialogue turn.
Both supervised learning (SL) and reinforcement learning (RL) frameworks cannot imitate humans well.
This study proposed an offline imitation learning model that learns policy from real dialogue datasets.
arXiv Detail & Related papers (2023-05-06T09:27:58Z) - CHAI: A CHatbot AI for Task-Oriented Dialogue with Offline Reinforcement
Learning [85.3987745097806]
offline reinforcement learning can be used to train dialogue agents entirely using static datasets collected from human speakers.
Experiments show that recently developed offline RL methods can be combined with language models to yield realistic dialogue agents.
arXiv Detail & Related papers (2022-04-18T17:43:21Z) - Offline-to-Online Reinforcement Learning via Balanced Replay and
Pessimistic Q-Ensemble [135.6115462399788]
Deep offline reinforcement learning has made it possible to train strong robotic agents from offline datasets.
State-action distribution shift may lead to severe bootstrap error during fine-tuning.
We propose a balanced replay scheme that prioritizes samples encountered online while also encouraging the use of near-on-policy samples.
arXiv Detail & Related papers (2021-07-01T16:26:54Z) - PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via
Relabeling Experience and Unsupervised Pre-training [94.87393610927812]
We present an off-policy, interactive reinforcement learning algorithm that capitalizes on the strengths of both feedback and off-policy learning.
We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods.
arXiv Detail & Related papers (2021-06-09T14:10:50Z) - A bandit approach to curriculum generation for automatic speech
recognition [7.008190762572486]
We present an approach to mitigate the lack of training data by employing Automated Curriculum Learning.
The goal of the approach is to optimize the training sequence of mini-batches ranked by the level of difficulty.
We test our approach on a truly low-resource language and show that the bandit framework has a good improvement over the baseline transfer-learning model.
arXiv Detail & Related papers (2021-02-06T20:32:10Z) - Automatic Curriculum Learning With Over-repetition Penalty for Dialogue
Policy Learning [8.744026064255337]
We propose a novel framework, Automatic Curriculum Learning-based Deep Q-Network (ACL-DQN), to realize the dialogue policy for automatic curriculum learning.
The teacher model arranges a meaningful ordered curriculum and automatically adjusts it by monitoring the learning progress of the dialogue agent.
Experiments show that the ACL-DQN significantly improves the effectiveness and stability of dialogue tasks with a statistically significant margin.
arXiv Detail & Related papers (2020-12-28T02:44:49Z) - Human-centric Dialog Training via Offline Reinforcement Learning [16.525761580699257]
We develop a novel class of offline reinforcement learning algorithms.
We test the resulting dialog model with ratings from 80 users in an open-domain setting.
arXiv Detail & Related papers (2020-10-12T16:53:00Z) - Rethinking Supervised Learning and Reinforcement Learning in
Task-Oriented Dialogue Systems [58.724629408229205]
We demonstrate how traditional supervised learning and a simulator-free adversarial learning method can be used to achieve performance comparable to state-of-the-art RL-based methods.
Our main goal is not to beat reinforcement learning with supervised learning, but to demonstrate the value of rethinking the role of reinforcement learning and supervised learning in optimizing task-oriented dialogue systems.
arXiv Detail & Related papers (2020-09-21T12:04:18Z) - Modelling Hierarchical Structure between Dialogue Policy and Natural
Language Generator with Option Framework for Task-oriented Dialogue System [49.39150449455407]
HDNO is an option framework for designing latent dialogue acts to avoid designing specific dialogue act representations.
We test HDNO on MultiWoz 2.0 and MultiWoz 2.1, the datasets on multi-domain dialogues, in comparison with word-level E2E model trained with RL, LaRL and HDSA.
arXiv Detail & Related papers (2020-06-11T20:55:28Z) - Guided Dialog Policy Learning without Adversarial Learning in the Loop [103.20723982440788]
A number of adversarial learning methods have been proposed to learn the reward function together with the dialogue policy.
We propose to decompose the adversarial training into two steps.
First, we train the discriminator with an auxiliary dialogue generator and then incorporate a derived reward model into a common RL method to guide the dialogue policy learning.
arXiv Detail & Related papers (2020-04-07T11:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.