Causal-aware Safe Policy Improvement for Task-oriented dialogue
- URL: http://arxiv.org/abs/2103.06370v1
- Date: Wed, 10 Mar 2021 22:34:28 GMT
- Title: Causal-aware Safe Policy Improvement for Task-oriented dialogue
- Authors: Govardana Sachithanandam Ramachandran, Kazuma Hashimoto, Caiming Xiong
- Abstract summary: We propose a batch RL framework for task oriented dialogue policy learning: causal safe policy improvement (CASPI)
We demonstrate the effectiveness of this framework on a dialogue-context-to-text Generation and end-to-end dialogue task of the Multiwoz2.0 dataset.
- Score: 45.88777832381149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent success of reinforcement learning's (RL) in solving complex tasks
is most often attributed to its capacity to explore and exploit an environment
where it has been trained. Sample efficiency is usually not an issue since
cheap simulators are available to sample data on-policy. On the other hand,
task oriented dialogues are usually learnt from offline data collected using
human demonstrations. Collecting diverse demonstrations and annotating them is
expensive. Unfortunately, use of RL methods trained on off-policy data are
prone to issues of bias and generalization, which are further exacerbated by
stochasticity in human response and non-markovian belief state of a dialogue
management system. To this end, we propose a batch RL framework for task
oriented dialogue policy learning: causal aware safe policy improvement
(CASPI). This method gives guarantees on dialogue policy's performance and also
learns to shape rewards according to intentions behind human responses, rather
than just mimicking demonstration data; this couple with batch-RL helps overall
with sample efficiency of the framework. We demonstrate the effectiveness of
this framework on a dialogue-context-to-text Generation and end-to-end dialogue
task of the Multiwoz2.0 dataset. The proposed method outperforms the current
state of the art on these metrics, in both case. In the end-to-end case, our
method trained only on 10\% of the data was able to out perform current state
in three out of four evaluation metrics.
Related papers
- Why Guided Dialog Policy Learning performs well? Understanding the role
of adversarial learning and its alternative [0.44267358790081573]
In recent years, reinforcement learning has emerged as a promising option for dialog policy learning (DPL)
One way to estimate rewards from collected data is to train the reward estimator and dialog policy simultaneously using adversarial learning (AL)
This paper identifies the role of AL in DPL through detailed analyses of the objective functions of dialog policy and reward estimator.
We propose a method that eliminates AL from reward estimation and DPL while retaining its advantages.
arXiv Detail & Related papers (2023-07-13T12:29:29Z) - FCC: Fusing Conversation History and Candidate Provenance for Contextual
Response Ranking in Dialogue Systems [53.89014188309486]
We present a flexible neural framework that can integrate contextual information from multiple channels.
We evaluate our model on the MSDialog dataset widely used for evaluating conversational response ranking tasks.
arXiv Detail & Related papers (2023-03-31T23:58:28Z) - Weakly Supervised Data Augmentation Through Prompting for Dialogue
Understanding [103.94325597273316]
We present a novel approach that iterates on augmentation quality by applying weakly-supervised filters.
We evaluate our methods on the emotion and act classification tasks in DailyDialog and the intent classification task in Facebook Multilingual Task-Oriented Dialogue.
For DailyDialog specifically, using 10% of the ground truth data we outperform the current state-of-the-art model which uses 100% of the data.
arXiv Detail & Related papers (2022-10-25T17:01:30Z) - CHAI: A CHatbot AI for Task-Oriented Dialogue with Offline Reinforcement
Learning [85.3987745097806]
offline reinforcement learning can be used to train dialogue agents entirely using static datasets collected from human speakers.
Experiments show that recently developed offline RL methods can be combined with language models to yield realistic dialogue agents.
arXiv Detail & Related papers (2022-04-18T17:43:21Z) - What Does The User Want? Information Gain for Hierarchical Dialogue
Policy Optimisation [3.1433893853959605]
optimisation via reinforcement learning (RL) is susceptible to sample inefficiency and instability.
We propose the usage of an intrinsic reward based on information gain to address this issue.
Our algorithm, which we call FeudalGain, achieves state-of-the-art results in most environments of the PyDial framework.
arXiv Detail & Related papers (2021-09-15T07:21:26Z) - Data-Efficient Methods for Dialogue Systems [4.061135251278187]
Conversational User Interface (CUI) has become ubiquitous in everyday life, in consumer-focused products like Siri and Alexa.
Deep learning underlies many recent breakthroughs in dialogue systems but requires very large amounts of training data, often annotated by experts.
In this thesis, we introduce a series of methods for training robust dialogue systems from minimal data.
arXiv Detail & Related papers (2020-12-05T02:51:09Z) - Learning Dialog Policies from Weak Demonstrations [32.149932955715705]
Building upon Deep Q-learning from Demonstrations (DQfD), we leverage dialog data to guide the agent to successfully respond to a user's requests.
We make progressively fewer assumptions about the data needed, using labeled, reduced-labeled, and even unlabeled data.
Experiments in a challenging multi-domain dialog system framework validate our approaches, and get high success rates even when trained on out-of-domain data.
arXiv Detail & Related papers (2020-04-23T10:22:16Z) - Guided Dialog Policy Learning without Adversarial Learning in the Loop [103.20723982440788]
A number of adversarial learning methods have been proposed to learn the reward function together with the dialogue policy.
We propose to decompose the adversarial training into two steps.
First, we train the discriminator with an auxiliary dialogue generator and then incorporate a derived reward model into a common RL method to guide the dialogue policy learning.
arXiv Detail & Related papers (2020-04-07T11:03:17Z) - Improving Multi-Turn Response Selection Models with Complementary
Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals.
We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.