Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for
Automatic Dialog Evaluation
- URL: http://arxiv.org/abs/2005.10716v2
- Date: Fri, 12 Jun 2020 04:05:58 GMT
- Title: Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for
Automatic Dialog Evaluation
- Authors: Weixin Liang, James Zou, Zhou Yu
- Abstract summary: Open Domain dialog system evaluation is one of the most important challenges in dialog research.
We propose an automatic evaluation model CMADE that automatically cleans self-reported user ratings as it trains on them.
Our experiments show that CMADE achieves 89.2% accuracy in the dialog comparison task.
- Score: 69.03658685761538
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open Domain dialog system evaluation is one of the most important challenges
in dialog research. Existing automatic evaluation metrics, such as BLEU are
mostly reference-based. They calculate the difference between the generated
response and a limited number of available references. Likert-score based
self-reported user rating is widely adopted by social conversational systems,
such as Amazon Alexa Prize chatbots. However, self-reported user rating suffers
from bias and variance among different users. To alleviate this problem, we
formulate dialog evaluation as a comparison task. We also propose an automatic
evaluation model CMADE (Comparison Model for Automatic Dialog Evaluation) that
automatically cleans self-reported user ratings as it trains on them.
Specifically, we first use a self-supervised method to learn better dialog
feature representation, and then use KNN and Shapley to remove confusing
samples. Our experiments show that CMADE achieves 89.2% accuracy in the dialog
comparison task.
Related papers
- Improving Open-Domain Dialogue Evaluation with a Causal Inference Model [8.625569782672663]
Explicit satisfaction ratings can be elicited from users, but users often do not provide ratings when asked.
Post-hoc ratings by experts are an alternative, but these can be both expensive and complex to collect.
Here, we explore the creation of automated methods for predicting both expert and user ratings of open-domain dialogues.
arXiv Detail & Related papers (2023-01-31T02:31:42Z) - GODEL: Large-Scale Pre-Training for Goal-Directed Dialog [119.1397031992088]
We introduce GODEL, a large pre-trained language model for dialog.
We show that GODEL outperforms state-of-the-art pre-trained dialog models in few-shot fine-tuning setups.
A novel feature of our evaluation methodology is the introduction of a notion of utility that assesses the usefulness of responses.
arXiv Detail & Related papers (2022-06-22T18:19:32Z) - What is wrong with you?: Leveraging User Sentiment for Automatic Dialog
Evaluation [73.03318027164605]
We propose to use information that can be automatically extracted from the next user utterance as a proxy to measure the quality of the previous system response.
Our model generalizes across both spoken and written open-domain dialog corpora collected from real and paid users.
arXiv Detail & Related papers (2022-03-25T22:09:52Z) - MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue
Evaluation [66.60285024216573]
A dialogue evaluator is expected to conduct assessment across domains as well.
Most of the state-of-the-art automatic dialogue evaluation metrics (ADMs) are not designed for multi-domain evaluation.
We are motivated to design a general and robust framework, MDD-Eval, to address the problem.
arXiv Detail & Related papers (2021-12-14T07:01:20Z) - User Response and Sentiment Prediction for Automatic Dialogue Evaluation [69.11124655437902]
We propose to use the sentiment of the next user utterance for turn or dialog level evaluation.
Experiments show our model outperforming existing automatic evaluation metrics on both written and spoken open-domain dialogue datasets.
arXiv Detail & Related papers (2021-11-16T22:19:17Z) - Modeling Performance in Open-Domain Dialogue with PARADISE [7.516971632888974]
We develop a PARADISE model for predicting the performance of Athena, a dialogue system that has participated in thousands of conversations with real users.
Our goal is to learn a general objective function that can be used to optimize the dialogue choices of any Alexa Prize system in real time.
arXiv Detail & Related papers (2021-10-21T14:17:59Z) - Speaker Sensitive Response Evaluation Model [17.381658875470638]
We propose an automatic evaluation model based on the similarity of the generated response with the conversational context.
We learn the model parameters from an unlabeled conversation corpus.
We show that our model can be applied to movie dialogues without any additional training.
arXiv Detail & Related papers (2020-06-12T08:59:10Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.