Evaluating Task-oriented Dialogue Systems: A Systematic Review of Measures, Constructs and their Operationalisations
- URL: http://arxiv.org/abs/2312.13871v2
- Date: Mon, 8 Apr 2024 07:36:48 GMT
- Title: Evaluating Task-oriented Dialogue Systems: A Systematic Review of Measures, Constructs and their Operationalisations
- Authors: Anouck Braggaar, Christine Liebrecht, Emiel van Miltenburg, Emiel Krahmer,
- Abstract summary: This review provides an overview of the used constructs and metrics in previous work.
It also discusses challenges in the context of dialogue system evaluation.
It develops a research agenda for the future of dialogue system evaluation.
- Score: 2.6122764214161363
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This review gives an extensive overview of evaluation methods for task-oriented dialogue systems, paying special attention to practical applications of dialogue systems, for example for customer service. The review (1) provides an overview of the used constructs and metrics in previous work, (2) discusses challenges in the context of dialogue system evaluation and (3) develops a research agenda for the future of dialogue system evaluation. We conducted a systematic review of four databases (ACL, ACM, IEEE and Web of Science), which after screening resulted in 122 studies. Those studies were carefully analysed for the constructs and methods they proposed for evaluation. We found a wide variety in both constructs and methods. Especially the operationalisation is not always clearly reported. Newer developments concerning large language models are discussed in two contexts: to power dialogue systems and to use in the evaluation process. We hope that future work will take a more critical approach to the operationalisation and specification of the used constructs. To work towards this aim, this review ends with recommendations for evaluation and suggestions for outstanding questions.
Related papers
- Manifesto from Dagstuhl Perspectives Workshop 24352 -- Conversational Agents: A Framework for Evaluation (CAFE) [59.64777874324281]
We defined the Conversational Agents Framework for Evaluation (CAFE) for the evaluation of CONIAC systems.<n>CAFE consists of six major components: 1) goals of the system's stakeholders, 2) user tasks to be studied in the evaluation, 3) aspects of the users carrying out the tasks, 4) evaluation criteria to be considered, 5) evaluation methodology to be applied, and 6) measures for the quantitative criteria chosen.
arXiv Detail & Related papers (2025-06-08T16:25:35Z) - Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey [64.08485471150486]
This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings.
We systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication.
arXiv Detail & Related papers (2025-03-28T14:08:40Z) - Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges [34.10494503049667]
Large language models (LLMs) are increasingly applied to outpatient referral tasks across healthcare systems.<n>There is a lack of standardized evaluation criteria to assess their effectiveness.<n>We propose a comprehensive evaluation framework specifically designed for such systems.
arXiv Detail & Related papers (2025-03-11T11:05:42Z) - Large Language Models as Evaluators for Conversational Recommender Systems: Benchmarking System Performance from a User-Centric Perspective [38.940283784200005]
This study proposes an automated LLM-based CRS evaluation framework.
It builds upon existing research in human-computer interaction and psychology.
We use this framework to evaluate four different conversational recommender systems.
arXiv Detail & Related papers (2025-01-16T12:06:56Z) - FCC: Fusing Conversation History and Candidate Provenance for Contextual
Response Ranking in Dialogue Systems [53.89014188309486]
We present a flexible neural framework that can integrate contextual information from multiple channels.
We evaluate our model on the MSDialog dataset widely used for evaluating conversational response ranking tasks.
arXiv Detail & Related papers (2023-03-31T23:58:28Z) - Don't Forget Your ABC's: Evaluating the State-of-the-Art in
Chat-Oriented Dialogue Systems [12.914512702731528]
This paper presents a novel human evaluation method to estimate the rates of many dialogue system behaviors.
Our method is used to evaluate four state-of-the-art open-domain dialogue systems and compared with existing approaches.
arXiv Detail & Related papers (2022-12-18T22:07:55Z) - User Satisfaction Estimation with Sequential Dialogue Act Modeling in
Goal-oriented Conversational Systems [65.88679683468143]
We propose a novel framework, namely USDA, to incorporate the sequential dynamics of dialogue acts for predicting user satisfaction.
USDA incorporates the sequential transitions of both content and act features in the dialogue to predict the user satisfaction.
Experimental results on four benchmark goal-oriented dialogue datasets show that the proposed method substantially and consistently outperforms existing methods on USE.
arXiv Detail & Related papers (2022-02-07T02:50:07Z) - How to Evaluate Your Dialogue Models: A Review of Approaches [2.7834038784275403]
We are first to divide the evaluation methods into three classes, i.e., automatic evaluation, human-involved evaluation and user simulator based evaluation.
The existence of benchmarks, suitable for the evaluation of dialogue techniques are also discussed in detail.
arXiv Detail & Related papers (2021-08-03T08:52:33Z) - Assessing Dialogue Systems with Distribution Distances [48.61159795472962]
We propose to measure the performance of a dialogue system by computing the distribution-wise distance between its generated conversations and real-world conversations.
Experiments on several dialogue corpora show that our proposed metrics correlate better with human judgments than existing metrics.
arXiv Detail & Related papers (2021-05-06T10:30:13Z) - Evaluate On-the-job Learning Dialogue Systems and a Case Study for
Natural Language Understanding [3.557633666039596]
We propose a first general methodology for evaluating on-the-job learning dialogue systems.
We describe a task-oriented dialogue system which improves on-the-job its natural language component through its user interactions.
arXiv Detail & Related papers (2021-02-26T16:54:16Z) - Modelling Hierarchical Structure between Dialogue Policy and Natural
Language Generator with Option Framework for Task-oriented Dialogue System [49.39150449455407]
HDNO is an option framework for designing latent dialogue acts to avoid designing specific dialogue act representations.
We test HDNO on MultiWoz 2.0 and MultiWoz 2.1, the datasets on multi-domain dialogues, in comparison with word-level E2E model trained with RL, LaRL and HDSA.
arXiv Detail & Related papers (2020-06-11T20:55:28Z) - Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of
Current Evaluation Protocols [17.14709845342071]
The current state of affairs suggests various evaluation protocols to assess chat-oriented dialogue management systems.
This paper presents a comprehensive synthesis of both automated and human evaluation methods on dialogue systems.
arXiv Detail & Related papers (2020-06-10T23:29:05Z) - Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical
Analysis of System-wise Evaluation [114.48767388174218]
This paper presents an empirical analysis on different types of dialog systems composed of different modules in different settings.
Our results show that a pipeline dialog system trained using fine-grained supervision signals at different component levels often obtains better performance than the systems that use joint or end-to-end models trained on coarse-grained labels.
arXiv Detail & Related papers (2020-05-15T05:20:06Z) - Recent Advances and Challenges in Task-oriented Dialog System [63.82055978899631]
Task-oriented dialog systems are attracting more and more attention in academic and industrial communities.
We discuss three critical topics for task-oriented dialog systems: (1) improving data efficiency to facilitate dialog modeling in low-resource settings, (2) modeling multi-turn dynamics for dialog policy learning, and (3) integrating domain knowledge into the dialog model.
arXiv Detail & Related papers (2020-03-17T01:34:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.