Guided Dialog Policy Learning without Adversarial Learning in the Loop
- URL: http://arxiv.org/abs/2004.03267v2
- Date: Wed, 16 Sep 2020 20:26:31 GMT
- Title: Guided Dialog Policy Learning without Adversarial Learning in the Loop
- Authors: Ziming Li, Sungjin Lee, Baolin Peng, Jinchao Li, Julia Kiseleva,
Maarten de Rijke, Shahin Shayandeh, Jianfeng Gao
- Abstract summary: A number of adversarial learning methods have been proposed to learn the reward function together with the dialogue policy.
We propose to decompose the adversarial training into two steps.
First, we train the discriminator with an auxiliary dialogue generator and then incorporate a derived reward model into a common RL method to guide the dialogue policy learning.
- Score: 103.20723982440788
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning (RL) methods have emerged as a popular choice for
training an efficient and effective dialogue policy. However, these methods
suffer from sparse and unstable reward signals returned by a user simulator
only when a dialogue finishes. Besides, the reward signal is manually designed
by human experts, which requires domain knowledge. Recently, a number of
adversarial learning methods have been proposed to learn the reward function
together with the dialogue policy. However, to alternatively update the
dialogue policy and the reward model on the fly, we are limited to
policy-gradient-based algorithms, such as REINFORCE and PPO. Moreover, the
alternating training of a dialogue agent and the reward model can easily get
stuck in local optima or result in mode collapse. To overcome the listed
issues, we propose to decompose the adversarial training into two steps. First,
we train the discriminator with an auxiliary dialogue generator and then
incorporate a derived reward model into a common RL method to guide the
dialogue policy learning. This approach is applicable to both on-policy and
off-policy RL methods. Based on our extensive experimentation, we can conclude
the proposed method: (1) achieves a remarkable task success rate using both
on-policy and off-policy RL methods; and (2) has the potential to transfer
knowledge from existing domains to a new domain.
Related papers
- JoTR: A Joint Transformer and Reinforcement Learning Framework for
Dialog Policy Learning [53.83063435640911]
Dialogue policy learning (DPL) is a crucial component of dialogue modelling.
We introduce a novel framework, JoTR, to generate flexible dialogue actions.
Unlike traditional methods, JoTR formulates a word-level policy that allows for a more dynamic and adaptable dialogue action generation.
arXiv Detail & Related papers (2023-09-01T03:19:53Z) - Why Guided Dialog Policy Learning performs well? Understanding the role
of adversarial learning and its alternative [0.44267358790081573]
In recent years, reinforcement learning has emerged as a promising option for dialog policy learning (DPL)
One way to estimate rewards from collected data is to train the reward estimator and dialog policy simultaneously using adversarial learning (AL)
This paper identifies the role of AL in DPL through detailed analyses of the objective functions of dialog policy and reward estimator.
We propose a method that eliminates AL from reward estimation and DPL while retaining its advantages.
arXiv Detail & Related papers (2023-07-13T12:29:29Z) - CHAI: A CHatbot AI for Task-Oriented Dialogue with Offline Reinforcement
Learning [85.3987745097806]
offline reinforcement learning can be used to train dialogue agents entirely using static datasets collected from human speakers.
Experiments show that recently developed offline RL methods can be combined with language models to yield realistic dialogue agents.
arXiv Detail & Related papers (2022-04-18T17:43:21Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - Integrating Pretrained Language Model for Dialogue Policy Learning [23.453017883791237]
Reinforcement Learning (RL) has been witnessed as its potential for training a dialogue policy agent towards maximizing the accumulated rewards given from users.
We decompose the adversarial training into two steps: 1) we integrate a pre-trained language model as a discriminator to judge whether the current system action is good enough for the last user action.
The experimental result demonstrates that our method significantly improves the complete rate (4.4%) and success rate (8.0%) of the dialogue system.
arXiv Detail & Related papers (2021-11-02T07:16:03Z) - Causal-aware Safe Policy Improvement for Task-oriented dialogue [45.88777832381149]
We propose a batch RL framework for task oriented dialogue policy learning: causal safe policy improvement (CASPI)
We demonstrate the effectiveness of this framework on a dialogue-context-to-text Generation and end-to-end dialogue task of the Multiwoz2.0 dataset.
arXiv Detail & Related papers (2021-03-10T22:34:28Z) - Rethinking Supervised Learning and Reinforcement Learning in
Task-Oriented Dialogue Systems [58.724629408229205]
We demonstrate how traditional supervised learning and a simulator-free adversarial learning method can be used to achieve performance comparable to state-of-the-art RL-based methods.
Our main goal is not to beat reinforcement learning with supervised learning, but to demonstrate the value of rethinking the role of reinforcement learning and supervised learning in optimizing task-oriented dialogue systems.
arXiv Detail & Related papers (2020-09-21T12:04:18Z) - Semi-Supervised Dialogue Policy Learning via Stochastic Reward
Estimation [33.688270031454095]
We introduce reward learning to learn from state-action pairs of an optimal policy to provide turn-by-turn rewards.
This approach requires complete state-action annotations of human-to-human dialogues.
We propose a novel reward learning approach for semi-supervised policy learning.
arXiv Detail & Related papers (2020-05-09T06:28:44Z) - Efficient Deep Reinforcement Learning via Adaptive Policy Transfer [50.51637231309424]
Policy Transfer Framework (PTF) is proposed to accelerate Reinforcement Learning (RL)
Our framework learns when and which source policy is the best to reuse for the target policy and when to terminate it.
Experimental results show it significantly accelerates the learning process and surpasses state-of-the-art policy transfer methods.
arXiv Detail & Related papers (2020-02-19T07:30:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.