Imperfect also Deserves Reward: Multi-Level and Sequential Reward
Modeling for Better Dialog Management
- URL: http://arxiv.org/abs/2104.04748v1
- Date: Sat, 10 Apr 2021 12:20:23 GMT
- Title: Imperfect also Deserves Reward: Multi-Level and Sequential Reward
Modeling for Better Dialog Management
- Authors: Zhengxu Hou, Bang Liu, Ruihui Zhao, Zijing Ou, Yafei Liu, Xi Chen,
Yefeng Zheng
- Abstract summary: For task-oriented dialog systems, training a Reinforcement Learning based Dialog Management module suffers from low sample efficiency and slow convergence speed due to the sparse rewards in RL.
We propose a multi-level reward modeling approach that factorizes a reward into a three-level hierarchy: domain, act, and slot.
- Score: 17.168214640974337
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For task-oriented dialog systems, training a Reinforcement Learning (RL)
based Dialog Management module suffers from low sample efficiency and slow
convergence speed due to the sparse rewards in RL.To solve this problem, many
strategies have been proposed to give proper rewards when training RL, but
their rewards lack interpretability and cannot accurately estimate the
distribution of state-action pairs in real dialogs. In this paper, we propose a
multi-level reward modeling approach that factorizes a reward into a
three-level hierarchy: domain, act, and slot. Based on inverse adversarial
reinforcement learning, our designed reward model can provide more accurate and
explainable reward signals for state-action pairs.Extensive evaluations show
that our approach can be applied to a wide range of reinforcement
learning-based dialog systems and significantly improves both the performance
and the speed of convergence.
Related papers
- Rewarding What Matters: Step-by-Step Reinforcement Learning for Task-Oriented Dialogue [17.47550065558479]
Reinforcement learning (RL) is a powerful approach to enhance task-oriented dialogue (TOD) systems.
Existing RL methods tend to mainly focus on generation tasks, while neglecting dialogue state tracking (DST) for understanding.
We introduce step-by-step rewards throughout the token generation to extend RL into both understanding and generation tasks.
arXiv Detail & Related papers (2024-06-20T16:15:40Z) - Improving Dialogue Agents by Decomposing One Global Explicit Annotation with Local Implicit Multimodal Feedback [71.55265615594669]
We describe an approach for aligning an LLM-based dialogue agent based on global (i.e., dialogue-level) rewards, while also taking into account naturally-occurring multimodal signals.
We run quantitative and qualitative human studies to evaluate the performance of our GELI approach, and find that it shows consistent improvements across various conversational metrics compared to baseline methods.
arXiv Detail & Related papers (2024-03-17T20:21:26Z) - Enhancing End-to-End Multi-Task Dialogue Systems: A Study on Intrinsic Motivation Reinforcement Learning Algorithms for Improved Training and Adaptability [1.0985060632689174]
Investigating intrinsic motivation reinforcement learning algorithms is the goal of this study.
We adapt techniques for random network distillation and curiosity-driven reinforcement learning to measure the frequency of state visits.
Experimental results on MultiWOZ, a heterogeneous dataset, show that intrinsic motivation-based debate systems outperform policies that depend on extrinsic incentives.
arXiv Detail & Related papers (2024-01-31T18:03:39Z) - Deep Reinforcement Learning from Hierarchical Preference Design [99.46415116087259]
This paper shows by exploiting certain structures, one can ease the reward design process.
We propose a hierarchical reward modeling framework -- HERON for scenarios: (I) The feedback signals naturally present hierarchy; (II) The reward is sparse, but with less important surrogate feedback to help policy learning.
arXiv Detail & Related papers (2023-09-06T00:44:29Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z) - Taming Continuous Posteriors for Latent Variational Dialogue Policies [1.0312968200748118]
We revisit Gaussian variational posteriors for latent-action RL and show that they can yield even better performance than categoricals.
We achieve this by simplifying the training procedure and propose ways to regularize the latent dialogue policy.
arXiv Detail & Related papers (2022-05-16T12:50:32Z) - Integrating Pretrained Language Model for Dialogue Policy Learning [23.453017883791237]
Reinforcement Learning (RL) has been witnessed as its potential for training a dialogue policy agent towards maximizing the accumulated rewards given from users.
We decompose the adversarial training into two steps: 1) we integrate a pre-trained language model as a discriminator to judge whether the current system action is good enough for the last user action.
The experimental result demonstrates that our method significantly improves the complete rate (4.4%) and success rate (8.0%) of the dialogue system.
arXiv Detail & Related papers (2021-11-02T07:16:03Z) - Rethinking Supervised Learning and Reinforcement Learning in
Task-Oriented Dialogue Systems [58.724629408229205]
We demonstrate how traditional supervised learning and a simulator-free adversarial learning method can be used to achieve performance comparable to state-of-the-art RL-based methods.
Our main goal is not to beat reinforcement learning with supervised learning, but to demonstrate the value of rethinking the role of reinforcement learning and supervised learning in optimizing task-oriented dialogue systems.
arXiv Detail & Related papers (2020-09-21T12:04:18Z) - Modelling Hierarchical Structure between Dialogue Policy and Natural
Language Generator with Option Framework for Task-oriented Dialogue System [49.39150449455407]
HDNO is an option framework for designing latent dialogue acts to avoid designing specific dialogue act representations.
We test HDNO on MultiWoz 2.0 and MultiWoz 2.1, the datasets on multi-domain dialogues, in comparison with word-level E2E model trained with RL, LaRL and HDSA.
arXiv Detail & Related papers (2020-06-11T20:55:28Z) - Semi-Supervised Dialogue Policy Learning via Stochastic Reward
Estimation [33.688270031454095]
We introduce reward learning to learn from state-action pairs of an optimal policy to provide turn-by-turn rewards.
This approach requires complete state-action annotations of human-to-human dialogues.
We propose a novel reward learning approach for semi-supervised policy learning.
arXiv Detail & Related papers (2020-05-09T06:28:44Z) - Guided Dialog Policy Learning without Adversarial Learning in the Loop [103.20723982440788]
A number of adversarial learning methods have been proposed to learn the reward function together with the dialogue policy.
We propose to decompose the adversarial training into two steps.
First, we train the discriminator with an auxiliary dialogue generator and then incorporate a derived reward model into a common RL method to guide the dialogue policy learning.
arXiv Detail & Related papers (2020-04-07T11:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.