User Response and Sentiment Prediction for Automatic Dialogue Evaluation
- URL: http://arxiv.org/abs/2111.08808v1
- Date: Tue, 16 Nov 2021 22:19:17 GMT
- Title: User Response and Sentiment Prediction for Automatic Dialogue Evaluation
- Authors: Sarik Ghazarian, Behnam Hedayatnia, Alexandros Papangelis, Yang Liu,
Dilek Hakkani-Tur
- Abstract summary: We propose to use the sentiment of the next user utterance for turn or dialog level evaluation.
Experiments show our model outperforming existing automatic evaluation metrics on both written and spoken open-domain dialogue datasets.
- Score: 69.11124655437902
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic evaluation is beneficial for open-domain dialog system development.
However, standard word-overlap metrics (BLEU, ROUGE) do not correlate well with
human judgements of open-domain dialog systems. In this work we propose to use
the sentiment of the next user utterance for turn or dialog level evaluation.
Specifically we propose three methods: one that predicts the next sentiment
directly, and two others that predict the next user utterance using an
utterance or a feedback generator model and then classify its sentiment.
Experiments show our model outperforming existing automatic evaluation metrics
on both written and spoken open-domain dialogue datasets.
Related papers
- GODEL: Large-Scale Pre-Training for Goal-Directed Dialog [119.1397031992088]
We introduce GODEL, a large pre-trained language model for dialog.
We show that GODEL outperforms state-of-the-art pre-trained dialog models in few-shot fine-tuning setups.
A novel feature of our evaluation methodology is the introduction of a notion of utility that assesses the usefulness of responses.
arXiv Detail & Related papers (2022-06-22T18:19:32Z) - What is wrong with you?: Leveraging User Sentiment for Automatic Dialog
Evaluation [73.03318027164605]
We propose to use information that can be automatically extracted from the next user utterance as a proxy to measure the quality of the previous system response.
Our model generalizes across both spoken and written open-domain dialog corpora collected from real and paid users.
arXiv Detail & Related papers (2022-03-25T22:09:52Z) - FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment
Act Flows [63.116280145770006]
We propose segment act, an extension of dialog act from utterance level to segment level, and crowdsource a large-scale dataset for it.
To utilize segment act flows, sequences of segment acts, for evaluation, we develop the first consensus-based dialogue evaluation framework, FlowEval.
arXiv Detail & Related papers (2022-02-14T11:37:20Z) - Modeling Performance in Open-Domain Dialogue with PARADISE [7.516971632888974]
We develop a PARADISE model for predicting the performance of Athena, a dialogue system that has participated in thousands of conversations with real users.
Our goal is to learn a general objective function that can be used to optimize the dialogue choices of any Alexa Prize system in real time.
arXiv Detail & Related papers (2021-10-21T14:17:59Z) - Speaker Sensitive Response Evaluation Model [17.381658875470638]
We propose an automatic evaluation model based on the similarity of the generated response with the conversational context.
We learn the model parameters from an unlabeled conversation corpus.
We show that our model can be applied to movie dialogues without any additional training.
arXiv Detail & Related papers (2020-06-12T08:59:10Z) - Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for
Automatic Dialog Evaluation [69.03658685761538]
Open Domain dialog system evaluation is one of the most important challenges in dialog research.
We propose an automatic evaluation model CMADE that automatically cleans self-reported user ratings as it trains on them.
Our experiments show that CMADE achieves 89.2% accuracy in the dialog comparison task.
arXiv Detail & Related papers (2020-05-21T15:14:49Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.