DiQAD: A Benchmark Dataset for End-to-End Open-domain Dialogue
Assessment
- URL: http://arxiv.org/abs/2310.16319v1
- Date: Wed, 25 Oct 2023 03:04:57 GMT
- Title: DiQAD: A Benchmark Dataset for End-to-End Open-domain Dialogue
Assessment
- Authors: Yukun Zhao, Lingyong Yan, Weiwei Sun, Chong Meng, Shuaiqiang Wang,
Zhicong Cheng, Zhaochun Ren, Dawei Yin
- Abstract summary: We release a large-scale dialogue quality assessment dataset (DiQAD) for automatically assessing open-domain dialogue quality.
Specifically, we establish the assessment criteria based on the dimensions conforming to human judgements on dialogue qualities.
We also annotate large-scale dialogues that conversed between real users based on these annotation criteria, which contains around 100,000 dialogues.
- Score: 38.26039323208791
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Dialogue assessment plays a critical role in the development of open-domain
dialogue systems. Existing work are uncapable of providing an end-to-end and
human-epistemic assessment dataset, while they only provide sub-metrics like
coherence or the dialogues are conversed between annotators far from real user
settings. In this paper, we release a large-scale dialogue quality assessment
dataset (DiQAD), for automatically assessing open-domain dialogue quality.
Specifically, we (1) establish the assessment criteria based on the dimensions
conforming to human judgements on dialogue qualities, and (2) annotate
large-scale dialogues that conversed between real users based on these
annotation criteria, which contains around 100,000 dialogues. We conduct
several experiments and report the performances of the baselines as the
benchmark on DiQAD. The dataset is openly accessible at
https://github.com/yukunZhao/Dataset_Dialogue_quality_evaluation.
Related papers
- ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark [26.100299485985197]
ComperDial consists of human-scored responses for 10,395 dialogue turns in 1,485 conversations collected from 99 dialogue agents.
In addition to single-turn response scores, ComperDial also contains dialogue-level human-annotated scores.
Building off ComperDial, we devise a new automatic evaluation metric to measure the general similarity of model-generated dialogues to human conversations.
arXiv Detail & Related papers (2024-06-17T05:51:04Z) - PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison [38.03304773600225]
PairEval is a novel dialogue evaluation metric for assessing responses by comparing their quality against responses in different conversations.
We show that PairEval exhibits a higher correlation with human judgments than baseline metrics.
We also find that the proposed comparative metric is more robust in detecting common failures from open-domain dialogue systems.
arXiv Detail & Related papers (2024-04-01T09:35:06Z) - Toward More Accurate and Generalizable Evaluation Metrics for
Task-Oriented Dialogs [19.43845920149182]
We introduce a new dialog-level annotation workflow called Dialog Quality.
DQA expert annotators evaluate the quality of dialogs as a whole, and also label dialogs for attributes such as goal completion and user sentiment.
We argue that having high-quality human-annotated data is an important component of evaluating interaction quality for large industrial-scale voice assistant platforms.
arXiv Detail & Related papers (2023-06-06T19:43:29Z) - ACCENT: An Automatic Event Commonsense Evaluation Metric for Open-Domain
Dialogue Systems [81.8658402934838]
We propose ACCENT, an event commonsense evaluation empowered by commonsense knowledge bases (CSKBs)
Our experiments show that ACCENT is an efficient metric for event commonsense evaluation, which achieves higher correlations with human judgments than existing baselines.
arXiv Detail & Related papers (2023-05-12T23:11:48Z) - GODEL: Large-Scale Pre-Training for Goal-Directed Dialog [119.1397031992088]
We introduce GODEL, a large pre-trained language model for dialog.
We show that GODEL outperforms state-of-the-art pre-trained dialog models in few-shot fine-tuning setups.
A novel feature of our evaluation methodology is the introduction of a notion of utility that assesses the usefulness of responses.
arXiv Detail & Related papers (2022-06-22T18:19:32Z) - MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for
Evaluating Open-Domain Dialogue [15.31433922183745]
We propose a Multi-Metric Evaluation based on Correlation Re-Scaling (MME-CRS) for evaluating open-domain dialogue.
MME-CRS ranks first on the final test data of DSTC10 track5 subtask1 Automatic Open-domain Dialogue Evaluation Challenge with a large margin.
arXiv Detail & Related papers (2022-06-19T13:43:59Z) - FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment
Act Flows [63.116280145770006]
We propose segment act, an extension of dialog act from utterance level to segment level, and crowdsource a large-scale dataset for it.
To utilize segment act flows, sequences of segment acts, for evaluation, we develop the first consensus-based dialogue evaluation framework, FlowEval.
arXiv Detail & Related papers (2022-02-14T11:37:20Z) - User Response and Sentiment Prediction for Automatic Dialogue Evaluation [69.11124655437902]
We propose to use the sentiment of the next user utterance for turn or dialog level evaluation.
Experiments show our model outperforming existing automatic evaluation metrics on both written and spoken open-domain dialogue datasets.
arXiv Detail & Related papers (2021-11-16T22:19:17Z) - Rethinking Dialogue State Tracking with Reasoning [76.0991910623001]
This paper proposes to track dialogue states gradually with reasoning over dialogue turns with the help of the back-end data.
Empirical results demonstrate that our method significantly outperforms the state-of-the-art methods by 38.6% in terms of joint belief accuracy for MultiWOZ 2.1.
arXiv Detail & Related papers (2020-05-27T02:05:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.