Related papers: Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation

Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation

URL: http://arxiv.org/abs/2005.04379v1
Date: Sat, 9 May 2020 06:28:44 GMT
Title: Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation
Authors: Xinting Huang, Jianzhong Qi, Yu Sun, Rui Zhang
Abstract summary: We introduce reward learning to learn from state-action pairs of an optimal policy to provide turn-by-turn rewards. This approach requires complete state-action annotations of human-to-human dialogues. We propose a novel reward learning approach for semi-supervised policy learning.
Score: 33.688270031454095
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dialogue policy optimization often obtains feedback until task completion in task-oriented dialogue systems. This is insufficient for training intermediate dialogue turns since supervision signals (or rewards) are only provided at the end of dialogues. To address this issue, reward learning has been introduced to learn from state-action pairs of an optimal policy to provide turn-by-turn rewards. This approach requires complete state-action annotations of human-to-human dialogues (i.e., expert demonstrations), which is labor intensive. To overcome this limitation, we propose a novel reward learning approach for semi-supervised policy learning. The proposed approach learns a dynamics model as the reward function which models dialogue progress (i.e., state-action sequences) based on expert demonstrations, either with or without annotations. The dynamics model computes rewards by predicting whether the dialogue progress is consistent with expert demonstrations. We further propose to learn action embeddings for a better generalization of the reward function. The proposed approach outperforms competitive policy learning baselines on MultiWOZ, a benchmark multi-domain dataset.

Related papers

Improving Dialogue Agents by Decomposing One Global Explicit Annotation with Local Implicit Multimodal Feedback [71.55265615594669]
We describe an approach for aligning an LLM-based dialogue agent based on global (i.e., dialogue-level) rewards, while also taking into account naturally-occurring multimodal signals. We run quantitative and qualitative human studies to evaluate the performance of our GELI approach, and find that it shows consistent improvements across various conversational metrics compared to baseline methods.
arXiv Detail & Related papers (2024-03-17T20:21:26Z)
Plug-and-Play Policy Planner for Large Language Model Powered Dialogue Agents [121.46051697742608]
We introduce a new dialogue policy planning paradigm to strategize dialogue problems with a tunable language model plug-in named PPDPP. Specifically, we develop a novel training framework to facilitate supervised fine-tuning over available human-annotated data. PPDPP consistently and substantially outperforms existing approaches on three different proactive dialogue applications.
arXiv Detail & Related papers (2023-11-01T03:20:16Z)
JoTR: A Joint Transformer and Reinforcement Learning Framework for Dialog Policy Learning [53.83063435640911]
Dialogue policy learning (DPL) is a crucial component of dialogue modelling. We introduce a novel framework, JoTR, to generate flexible dialogue actions. Unlike traditional methods, JoTR formulates a word-level policy that allows for a more dynamic and adaptable dialogue action generation.
arXiv Detail & Related papers (2023-09-01T03:19:53Z)
Why Guided Dialog Policy Learning performs well? Understanding the role of adversarial learning and its alternative [0.44267358790081573]
In recent years, reinforcement learning has emerged as a promising option for dialog policy learning (DPL) One way to estimate rewards from collected data is to train the reward estimator and dialog policy simultaneously using adversarial learning (AL) This paper identifies the role of AL in DPL through detailed analyses of the objective functions of dialog policy and reward estimator. We propose a method that eliminates AL from reward estimation and DPL while retaining its advantages.
arXiv Detail & Related papers (2023-07-13T12:29:29Z)
Taming Continuous Posteriors for Latent Variational Dialogue Policies [1.0312968200748118]
We revisit Gaussian variational posteriors for latent-action RL and show that they can yield even better performance than categoricals. We achieve this by simplifying the training procedure and propose ways to regularize the latent dialogue policy.
arXiv Detail & Related papers (2022-05-16T12:50:32Z)
Integrating Pretrained Language Model for Dialogue Policy Learning [23.453017883791237]
Reinforcement Learning (RL) has been witnessed as its potential for training a dialogue policy agent towards maximizing the accumulated rewards given from users. We decompose the adversarial training into two steps: 1) we integrate a pre-trained language model as a discriminator to judge whether the current system action is good enough for the last user action. The experimental result demonstrates that our method significantly improves the complete rate (4.4%) and success rate (8.0%) of the dialogue system.
arXiv Detail & Related papers (2021-11-02T07:16:03Z)
WeaSuL: Weakly Supervised Dialogue Policy Learning: Reward Estimation for Multi-turn Dialogue [17.663449579168297]
We simulate a dialogue between an agent and a user (modelled similar to an agent with supervised learning objective) to interact with each other. The agent uses dynamic blocking to generate ranked diverse responses and exploration-exploitation to select among the Top-K responses. Empirical studies with two benchmarks indicate that our model can significantly out-perform the response quality and lead to a successful conversation.
arXiv Detail & Related papers (2021-08-01T08:00:45Z)
Rethinking Supervised Learning and Reinforcement Learning in Task-Oriented Dialogue Systems [58.724629408229205]
We demonstrate how traditional supervised learning and a simulator-free adversarial learning method can be used to achieve performance comparable to state-of-the-art RL-based methods. Our main goal is not to beat reinforcement learning with supervised learning, but to demonstrate the value of rethinking the role of reinforcement learning and supervised learning in optimizing task-oriented dialogue systems.
arXiv Detail & Related papers (2020-09-21T12:04:18Z)
Learning an Effective Context-Response Matching Model with Self-Supervised Tasks for Retrieval-based Dialogues [88.73739515457116]
We introduce four self-supervised tasks including next session prediction, utterance restoration, incoherence detection and consistency discrimination. We jointly train the PLM-based response selection model with these auxiliary tasks in a multi-task manner. Experiment results indicate that the proposed auxiliary self-supervised tasks bring significant improvement for multi-turn response selection.
arXiv Detail & Related papers (2020-09-14T08:44:46Z)
Guided Dialog Policy Learning without Adversarial Learning in the Loop [103.20723982440788]
A number of adversarial learning methods have been proposed to learn the reward function together with the dialogue policy. We propose to decompose the adversarial training into two steps. First, we train the discriminator with an auxiliary dialogue generator and then incorporate a derived reward model into a common RL method to guide the dialogue policy learning.
arXiv Detail & Related papers (2020-04-07T11:03:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.