Modeling Performance in Open-Domain Dialogue with PARADISE
- URL: http://arxiv.org/abs/2110.11164v1
- Date: Thu, 21 Oct 2021 14:17:59 GMT
- Title: Modeling Performance in Open-Domain Dialogue with PARADISE
- Authors: Marilyn Walker, Colin Harmon, James Graupera, Davan Harrison and Steve
Whittaker
- Abstract summary: We develop a PARADISE model for predicting the performance of Athena, a dialogue system that has participated in thousands of conversations with real users.
Our goal is to learn a general objective function that can be used to optimize the dialogue choices of any Alexa Prize system in real time.
- Score: 7.516971632888974
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There has recently been an explosion of work on spoken dialogue systems,
along with an increased interest in open-domain systems that engage in casual
conversations on popular topics such as movies, books and music. These systems
aim to socially engage, entertain, and even empathize with their users. Since
the achievement of such social goals is hard to measure, recent research has
used dialogue length or human ratings as evaluation metrics, and developed
methods for automatically calculating novel metrics, such as coherence,
consistency, relevance and engagement. Here we develop a PARADISE model for
predicting the performance of Athena, a dialogue system that has participated
in thousands of conversations with real users, while competing as a finalist in
the Alexa Prize. We use both user ratings and dialogue length as metrics for
dialogue quality, and experiment with predicting these metrics using automatic
features that are both system dependent and independent. Our goal is to learn a
general objective function that can be used to optimize the dialogue choices of
any Alexa Prize system in real time and evaluate its performance. Our best
model for predicting user ratings gets an R$^2$ of .136 with a DistilBert
model, and the best model for predicting length with system independent
features gets an R$^2$ of .865, suggesting that conversation length may be a
more reliable measure for automatic training of dialogue systems.
Related papers
- Psychological Metrics for Dialog System Evaluation [16.16116910201279]
We present five interpretable metrics from established psychology that are fundamental to human communication and relationships.
The psychological metrics are compared against seven state-of-the-art traditional metrics.
arXiv Detail & Related papers (2023-05-24T06:02:32Z) - Let's Get Personal: Personal Questions Improve SocialBot Performance in
the Alexa Prize [0.0]
There has been an increased focus on creating conversational open-domain dialogue systems in the spoken dialogue community.
Unlike traditional dialogue systems, these conversational systems cannot assume any specific information need or domain restrictions.
We developed a robust open-domain conversational system, Athena, that real Amazon Echo users access and evaluate at scale.
arXiv Detail & Related papers (2023-03-09T00:10:29Z) - Dialogue Evaluation with Offline Reinforcement Learning [2.580163308334609]
Task-oriented dialogue systems aim to fulfill user goals through natural language interactions.
They are ideally evaluated with human users, which is unattainable to do at every iteration of the development phase.
We propose the use of offline reinforcement learning for dialogue evaluation based on a static corpus.
arXiv Detail & Related papers (2022-09-02T08:32:52Z) - GODEL: Large-Scale Pre-Training for Goal-Directed Dialog [119.1397031992088]
We introduce GODEL, a large pre-trained language model for dialog.
We show that GODEL outperforms state-of-the-art pre-trained dialog models in few-shot fine-tuning setups.
A novel feature of our evaluation methodology is the introduction of a notion of utility that assesses the usefulness of responses.
arXiv Detail & Related papers (2022-06-22T18:19:32Z) - What is wrong with you?: Leveraging User Sentiment for Automatic Dialog
Evaluation [73.03318027164605]
We propose to use information that can be automatically extracted from the next user utterance as a proxy to measure the quality of the previous system response.
Our model generalizes across both spoken and written open-domain dialog corpora collected from real and paid users.
arXiv Detail & Related papers (2022-03-25T22:09:52Z) - User Response and Sentiment Prediction for Automatic Dialogue Evaluation [69.11124655437902]
We propose to use the sentiment of the next user utterance for turn or dialog level evaluation.
Experiments show our model outperforming existing automatic evaluation metrics on both written and spoken open-domain dialogue datasets.
arXiv Detail & Related papers (2021-11-16T22:19:17Z) - Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for
Automatic Dialog Evaluation [69.03658685761538]
Open Domain dialog system evaluation is one of the most important challenges in dialog research.
We propose an automatic evaluation model CMADE that automatically cleans self-reported user ratings as it trains on them.
Our experiments show that CMADE achieves 89.2% accuracy in the dialog comparison task.
arXiv Detail & Related papers (2020-05-21T15:14:49Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z) - Attention over Parameters for Dialogue Systems [69.48852519856331]
We learn a dialogue system that independently parameterizes different dialogue skills, and learns to select and combine each of them through Attention over Parameters (AoP)
The experimental results show that this approach achieves competitive performance on a combined dataset of MultiWOZ, In-Car Assistant, and Persona-Chat.
arXiv Detail & Related papers (2020-01-07T03:10:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.