Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical
Analysis of System-wise Evaluation
- URL: http://arxiv.org/abs/2005.07362v1
- Date: Fri, 15 May 2020 05:20:06 GMT
- Title: Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical
Analysis of System-wise Evaluation
- Authors: Ryuichi Takanobu, Qi Zhu, Jinchao Li, Baolin Peng, Jianfeng Gao,
Minlie Huang
- Abstract summary: This paper presents an empirical analysis on different types of dialog systems composed of different modules in different settings.
Our results show that a pipeline dialog system trained using fine-grained supervision signals at different component levels often obtains better performance than the systems that use joint or end-to-end models trained on coarse-grained labels.
- Score: 114.48767388174218
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is a growing interest in developing goal-oriented dialog systems which
serve users in accomplishing complex tasks through multi-turn conversations.
Although many methods are devised to evaluate and improve the performance of
individual dialog components, there is a lack of comprehensive empirical study
on how different components contribute to the overall performance of a dialog
system. In this paper, we perform a system-wise evaluation and present an
empirical analysis on different types of dialog systems which are composed of
different modules in different settings. Our results show that (1) a pipeline
dialog system trained using fine-grained supervision signals at different
component levels often obtains better performance than the systems that use
joint or end-to-end models trained on coarse-grained labels, (2)
component-wise, single-turn evaluation results are not always consistent with
the overall performance of a dialog system, and (3) despite the discrepancy
between simulators and human users, simulated evaluation is still a valid
alternative to the costly human evaluation especially in the early stage of
development.
Related papers
- ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark [26.100299485985197]
ComperDial consists of human-scored responses for 10,395 dialogue turns in 1,485 conversations collected from 99 dialogue agents.
In addition to single-turn response scores, ComperDial also contains dialogue-level human-annotated scores.
Building off ComperDial, we devise a new automatic evaluation metric to measure the general similarity of model-generated dialogues to human conversations.
arXiv Detail & Related papers (2024-06-17T05:51:04Z) - VDialogUE: A Unified Evaluation Benchmark for Visually-grounded Dialogue [70.64560638766018]
We propose textbfVDialogUE, a textbfVisually-grounded textbfDialogue benchmark for textbfUnified textbfEvaluation.
It defines five core multi-modal dialogue tasks and covers six datasets.
We also present a straightforward yet efficient baseline model, named textbfVISIT(textbfVISually-grounded dtextbfIalog textbfTransformer), to promote the advancement of
arXiv Detail & Related papers (2023-09-14T02:09:20Z) - Don't Forget Your ABC's: Evaluating the State-of-the-Art in
Chat-Oriented Dialogue Systems [12.914512702731528]
This paper presents a novel human evaluation method to estimate the rates of many dialogue system behaviors.
Our method is used to evaluate four state-of-the-art open-domain dialogue systems and compared with existing approaches.
arXiv Detail & Related papers (2022-12-18T22:07:55Z) - Is MultiWOZ a Solved Task? An Interactive TOD Evaluation Framework with
User Simulator [37.590563896382456]
We propose an interactive evaluation framework for Task-Oriented Dialogue (TOD) systems.
We first build a goal-oriented user simulator based on pre-trained models and then use the user simulator to interact with the dialogue system to generate dialogues.
Experimental results show that RL-based TOD systems trained by our proposed user simulator can achieve nearly 98% inform and success rates.
arXiv Detail & Related papers (2022-10-26T07:41:32Z) - Dialogue Evaluation with Offline Reinforcement Learning [2.580163308334609]
Task-oriented dialogue systems aim to fulfill user goals through natural language interactions.
They are ideally evaluated with human users, which is unattainable to do at every iteration of the development phase.
We propose the use of offline reinforcement learning for dialogue evaluation based on a static corpus.
arXiv Detail & Related papers (2022-09-02T08:32:52Z) - FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment
Act Flows [63.116280145770006]
We propose segment act, an extension of dialog act from utterance level to segment level, and crowdsource a large-scale dataset for it.
To utilize segment act flows, sequences of segment acts, for evaluation, we develop the first consensus-based dialogue evaluation framework, FlowEval.
arXiv Detail & Related papers (2022-02-14T11:37:20Z) - DynaEval: Unifying Turn and Dialogue Level Evaluation [60.66883575106898]
We propose DynaEval, a unified automatic evaluation framework.
It is capable of performing turn-level evaluation, but also holistically considers the quality of the entire dialogue.
Experiments show that DynaEval significantly outperforms the state-of-the-art dialogue coherence model.
arXiv Detail & Related papers (2021-06-02T12:23:18Z) - Assessing Dialogue Systems with Distribution Distances [48.61159795472962]
We propose to measure the performance of a dialogue system by computing the distribution-wise distance between its generated conversations and real-world conversations.
Experiments on several dialogue corpora show that our proposed metrics correlate better with human judgments than existing metrics.
arXiv Detail & Related papers (2021-05-06T10:30:13Z) - Attention over Parameters for Dialogue Systems [69.48852519856331]
We learn a dialogue system that independently parameterizes different dialogue skills, and learns to select and combine each of them through Attention over Parameters (AoP)
The experimental results show that this approach achieves competitive performance on a combined dataset of MultiWOZ, In-Car Assistant, and Persona-Chat.
arXiv Detail & Related papers (2020-01-07T03:10:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.