Guided Dialog Policy Learning without Adversarial Learning in the Loop
- URL: http://arxiv.org/abs/2004.03267v2
- Date: Wed, 16 Sep 2020 20:26:31 GMT
- Title: Guided Dialog Policy Learning without Adversarial Learning in the Loop
- Authors: Ziming Li, Sungjin Lee, Baolin Peng, Jinchao Li, Julia Kiseleva,
Maarten de Rijke, Shahin Shayandeh, Jianfeng Gao
- Abstract summary: A number of adversarial learning methods have been proposed to learn the reward function together with the dialogue policy.
We propose to decompose the adversarial training into two steps.
First, we train the discriminator with an auxiliary dialogue generator and then incorporate a derived reward model into a common RL method to guide the dialogue policy learning.
- Score: 103.20723982440788
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning (RL) methods have emerged as a popular choice for
training an efficient and effective dialogue policy. However, these methods
suffer from sparse and unstable reward signals returned by a user simulator
only when a dialogue finishes. Besides, the reward signal is manually designed
by human experts, which requires domain knowledge. Recently, a number of
adversarial learning methods have been proposed to learn the reward function
together with the dialogue policy. However, to alternatively update the
dialogue policy and the reward model on the fly, we are limited to
policy-gradient-based algorithms, such as REINFORCE and PPO. Moreover, the
alternating training of a dialogue agent and the reward model can easily get
stuck in local optima or result in mode collapse. To overcome the listed
issues, we propose to decompose the adversarial training into two steps. First,
we train the discriminator with an auxiliary dialogue generator and then
incorporate a derived reward model into a common RL method to guide the
dialogue policy learning. This approach is applicable to both on-policy and
off-policy RL methods. Based on our extensive experimentation, we can conclude
the proposed method: (1) achieves a remarkable task success rate using both
on-policy and off-policy RL methods; and (2) has the potential to transfer
knowledge from existing domains to a new domain.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.