Toward More Accurate and Generalizable Evaluation Metrics for
Task-Oriented Dialogs
- URL: http://arxiv.org/abs/2306.03984v2
- Date: Fri, 9 Jun 2023 01:17:39 GMT
- Title: Toward More Accurate and Generalizable Evaluation Metrics for
Task-Oriented Dialogs
- Authors: Abishek Komma, Nagesh Panyam Chandrasekarasastry, Timothy Leffel, Anuj
Goyal, Angeliki Metallinou, Spyros Matsoukas, Aram Galstyan
- Abstract summary: We introduce a new dialog-level annotation workflow called Dialog Quality.
DQA expert annotators evaluate the quality of dialogs as a whole, and also label dialogs for attributes such as goal completion and user sentiment.
We argue that having high-quality human-annotated data is an important component of evaluating interaction quality for large industrial-scale voice assistant platforms.
- Score: 19.43845920149182
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Measurement of interaction quality is a critical task for the improvement of
spoken dialog systems. Existing approaches to dialog quality estimation either
focus on evaluating the quality of individual turns, or collect dialog-level
quality measurements from end users immediately following an interaction. In
contrast to these approaches, we introduce a new dialog-level annotation
workflow called Dialog Quality Annotation (DQA). DQA expert annotators evaluate
the quality of dialogs as a whole, and also label dialogs for attributes such
as goal completion and user sentiment. In this contribution, we show that: (i)
while dialog quality cannot be completely decomposed into dialog-level
attributes, there is a strong relationship between some objective dialog
attributes and judgments of dialog quality; (ii) for the task of dialog-level
quality estimation, a supervised model trained on dialog-level annotations
outperforms methods based purely on aggregating turn-level features; and (iii)
the proposed evaluation model shows better domain generalization ability
compared to the baselines. On the basis of these results, we argue that having
high-quality human-annotated data is an important component of evaluating
interaction quality for large industrial-scale voice assistant platforms.
Related papers
- Interaction Matters: An Evaluation Framework for Interactive Dialogue Assessment on English Second Language Conversations [22.56326809612278]
We present an evaluation framework for interactive dialogue assessment in the context of English as a Second Language speakers.
Our framework collects dialogue-level interactivity labels and micro-level span features.
We study how the micro-level features influence the (higher level) interactivity quality of ESL dialogues by constructing various machine learning-based models.
arXiv Detail & Related papers (2024-07-09T00:56:59Z) - ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark [26.100299485985197]
ComperDial consists of human-scored responses for 10,395 dialogue turns in 1,485 conversations collected from 99 dialogue agents.
In addition to single-turn response scores, ComperDial also contains dialogue-level human-annotated scores.
Building off ComperDial, we devise a new automatic evaluation metric to measure the general similarity of model-generated dialogues to human conversations.
arXiv Detail & Related papers (2024-06-17T05:51:04Z) - Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems [57.16442740983528]
Crowdsourced labels play a crucial role in evaluating task-oriented dialogue systems.
Previous studies suggest using only a portion of the dialogue context in the annotation process.
This study investigates the influence of dialogue context on annotation quality.
arXiv Detail & Related papers (2024-04-15T17:56:39Z) - DiQAD: A Benchmark Dataset for End-to-End Open-domain Dialogue
Assessment [38.26039323208791]
We release a large-scale dialogue quality assessment dataset (DiQAD) for automatically assessing open-domain dialogue quality.
Specifically, we establish the assessment criteria based on the dimensions conforming to human judgements on dialogue qualities.
We also annotate large-scale dialogues that conversed between real users based on these annotation criteria, which contains around 100,000 dialogues.
arXiv Detail & Related papers (2023-10-25T03:04:57Z) - FCC: Fusing Conversation History and Candidate Provenance for Contextual
Response Ranking in Dialogue Systems [53.89014188309486]
We present a flexible neural framework that can integrate contextual information from multiple channels.
We evaluate our model on the MSDialog dataset widely used for evaluating conversational response ranking tasks.
arXiv Detail & Related papers (2023-03-31T23:58:28Z) - FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation [58.46761798403072]
We propose a dialogue-level metric that consists of three sub-metrics with each targeting a specific dimension.
The sub-metrics are trained with novel self-supervised objectives and exhibit strong correlations with human judgment for their respective dimensions.
Compared to the existing state-of-the-art metric, the combined metrics achieve around 16% relative improvement on average.
arXiv Detail & Related papers (2022-10-25T08:26:03Z) - GODEL: Large-Scale Pre-Training for Goal-Directed Dialog [119.1397031992088]
We introduce GODEL, a large pre-trained language model for dialog.
We show that GODEL outperforms state-of-the-art pre-trained dialog models in few-shot fine-tuning setups.
A novel feature of our evaluation methodology is the introduction of a notion of utility that assesses the usefulness of responses.
arXiv Detail & Related papers (2022-06-22T18:19:32Z) - FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment
Act Flows [63.116280145770006]
We propose segment act, an extension of dialog act from utterance level to segment level, and crowdsource a large-scale dataset for it.
To utilize segment act flows, sequences of segment acts, for evaluation, we develop the first consensus-based dialogue evaluation framework, FlowEval.
arXiv Detail & Related papers (2022-02-14T11:37:20Z) - DynaEval: Unifying Turn and Dialogue Level Evaluation [60.66883575106898]
We propose DynaEval, a unified automatic evaluation framework.
It is capable of performing turn-level evaluation, but also holistically considers the quality of the entire dialogue.
Experiments show that DynaEval significantly outperforms the state-of-the-art dialogue coherence model.
arXiv Detail & Related papers (2021-06-02T12:23:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.