Why Guided Dialog Policy Learning performs well? Understanding the role
of adversarial learning and its alternative
- URL: http://arxiv.org/abs/2307.06721v1
- Date: Thu, 13 Jul 2023 12:29:29 GMT
- Title: Why Guided Dialog Policy Learning performs well? Understanding the role
of adversarial learning and its alternative
- Authors: Sho Shimoyama, Tetsuro Morimura, Kenshi Abe, Toda Takamichi, Yuta
Tomomatsu, Masakazu Sugiyama, Asahi Hentona, Yuuki Azuma, Hirotaka Ninomiya
- Abstract summary: In recent years, reinforcement learning has emerged as a promising option for dialog policy learning (DPL)
One way to estimate rewards from collected data is to train the reward estimator and dialog policy simultaneously using adversarial learning (AL)
This paper identifies the role of AL in DPL through detailed analyses of the objective functions of dialog policy and reward estimator.
We propose a method that eliminates AL from reward estimation and DPL while retaining its advantages.
- Score: 0.44267358790081573
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dialog policies, which determine a system's action based on the current state
at each dialog turn, are crucial to the success of the dialog. In recent years,
reinforcement learning (RL) has emerged as a promising option for dialog policy
learning (DPL). In RL-based DPL, dialog policies are updated according to
rewards. The manual construction of fine-grained rewards, such as
state-action-based ones, to effectively guide the dialog policy is challenging
in multi-domain task-oriented dialog scenarios with numerous state-action pair
combinations. One way to estimate rewards from collected data is to train the
reward estimator and dialog policy simultaneously using adversarial learning
(AL). Although this method has demonstrated superior performance
experimentally, it is fraught with the inherent problems of AL, such as mode
collapse. This paper first identifies the role of AL in DPL through detailed
analyses of the objective functions of dialog policy and reward estimator.
Next, based on these analyses, we propose a method that eliminates AL from
reward estimation and DPL while retaining its advantages. We evaluate our
method using MultiWOZ, a multi-domain task-oriented dialog corpus.
Related papers
- Plug-and-Play Policy Planner for Large Language Model Powered Dialogue
Agents [121.46051697742608]
We introduce a new dialogue policy planning paradigm to strategize dialogue problems with a tunable language model plug-in named PPDPP.
Specifically, we develop a novel training framework to facilitate supervised fine-tuning over available human-annotated data.
PPDPP consistently and substantially outperforms existing approaches on three different proactive dialogue applications.
arXiv Detail & Related papers (2023-11-01T03:20:16Z) - JoTR: A Joint Transformer and Reinforcement Learning Framework for
Dialog Policy Learning [53.83063435640911]
Dialogue policy learning (DPL) is a crucial component of dialogue modelling.
We introduce a novel framework, JoTR, to generate flexible dialogue actions.
Unlike traditional methods, JoTR formulates a word-level policy that allows for a more dynamic and adaptable dialogue action generation.
arXiv Detail & Related papers (2023-09-01T03:19:53Z) - "Think Before You Speak": Improving Multi-Action Dialog Policy by
Planning Single-Action Dialogs [33.78889030078026]
Multi-action dialog policy (MADP) generates multiple atomic dialog actions per turn.
We propose Planning Enhanced Dialog Policy (PEDP), a novel multi-task learning framework that learns single-action dialog dynamics.
Our fully supervised learning-based method achieves a solid task success rate of 90.6%, improving 3% compared to the state-of-the-art methods.
arXiv Detail & Related papers (2022-04-25T07:55:53Z) - In-Context Learning for Few-Shot Dialogue State Tracking [55.91832381893181]
We propose an in-context (IC) learning framework for few-shot dialogue state tracking (DST)
A large pre-trained language model (LM) takes a test instance and a few annotated examples as input, and directly decodes the dialogue states without any parameter updates.
This makes the LM more flexible and scalable compared to prior few-shot DST work when adapting to new domains and scenarios.
arXiv Detail & Related papers (2022-03-16T11:58:24Z) - Causal-aware Safe Policy Improvement for Task-oriented dialogue [45.88777832381149]
We propose a batch RL framework for task oriented dialogue policy learning: causal safe policy improvement (CASPI)
We demonstrate the effectiveness of this framework on a dialogue-context-to-text Generation and end-to-end dialogue task of the Multiwoz2.0 dataset.
arXiv Detail & Related papers (2021-03-10T22:34:28Z) - Distributed Structured Actor-Critic Reinforcement Learning for Universal
Dialogue Management [29.57382819573169]
We focus on devising a policy that chooses which dialogue action to respond to the user.
The sequential system decision-making process can be abstracted into a partially observable Markov decision process.
In the past few years, there are many deep reinforcement learning (DRL) algorithms, which use neural networks (NN) as function approximators.
arXiv Detail & Related papers (2020-09-22T05:39:31Z) - Rethinking Dialogue State Tracking with Reasoning [76.0991910623001]
This paper proposes to track dialogue states gradually with reasoning over dialogue turns with the help of the back-end data.
Empirical results demonstrate that our method significantly outperforms the state-of-the-art methods by 38.6% in terms of joint belief accuracy for MultiWOZ 2.1.
arXiv Detail & Related papers (2020-05-27T02:05:33Z) - Semi-Supervised Dialogue Policy Learning via Stochastic Reward
Estimation [33.688270031454095]
We introduce reward learning to learn from state-action pairs of an optimal policy to provide turn-by-turn rewards.
This approach requires complete state-action annotations of human-to-human dialogues.
We propose a novel reward learning approach for semi-supervised policy learning.
arXiv Detail & Related papers (2020-05-09T06:28:44Z) - Guided Dialog Policy Learning without Adversarial Learning in the Loop [103.20723982440788]
A number of adversarial learning methods have been proposed to learn the reward function together with the dialogue policy.
We propose to decompose the adversarial training into two steps.
First, we train the discriminator with an auxiliary dialogue generator and then incorporate a derived reward model into a common RL method to guide the dialogue policy learning.
arXiv Detail & Related papers (2020-04-07T11:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.