Psychological Metrics for Dialog System Evaluation
- URL: http://arxiv.org/abs/2305.14757v2
- Date: Sat, 16 Sep 2023 02:35:44 GMT
- Title: Psychological Metrics for Dialog System Evaluation
- Authors: Salvatore Giorgi, Shreya Havaldar, Farhan Ahmed, Zuhaib Akhtar,
Shalaka Vaidya, Gary Pan, Lyle H. Ungar, H. Andrew Schwartz, Joao Sedoc
- Abstract summary: We present five interpretable metrics from established psychology that are fundamental to human communication and relationships.
The psychological metrics are compared against seven state-of-the-art traditional metrics.
- Score: 16.16116910201279
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present metrics for evaluating dialog systems through a
psychologically-grounded "human" lens in which conversational agents express a
diversity of both states (e.g., emotion) and traits (e.g., personality), just
as people do. We present five interpretable metrics from established psychology
that are fundamental to human communication and relationships: emotional
entropy, linguistic style and emotion matching, agreeableness, and empathy.
These metrics can be applied (1) across dialogs and (2) on turns within
dialogs. The psychological metrics are compared against seven state-of-the-art
traditional metrics (e.g., BARTScore and BLEURT) on seven standard dialog
system data sets. We also introduce a novel data set, the Three Bot Dialog
Evaluation Corpus, which consists of annotated conversations from ChatGPT,
GPT-3, and BlenderBot. We demonstrate that our proposed metrics offer novel
information; they are uncorrelated with traditional metrics, can be used to
meaningfully compare dialog systems, and lead to increased accuracy (beyond
existing traditional metrics) in predicting crowd-sourced dialog judgements.
The interpretability and unique signal of our psychological metrics make them a
valuable tool for evaluating and improving dialog systems.
Related papers
- ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark [26.100299485985197]
ComperDial consists of human-scored responses for 10,395 dialogue turns in 1,485 conversations collected from 99 dialogue agents.
In addition to single-turn response scores, ComperDial also contains dialogue-level human-annotated scores.
Building off ComperDial, we devise a new automatic evaluation metric to measure the general similarity of model-generated dialogues to human conversations.
arXiv Detail & Related papers (2024-06-17T05:51:04Z) - User Response and Sentiment Prediction for Automatic Dialogue Evaluation [69.11124655437902]
We propose to use the sentiment of the next user utterance for turn or dialog level evaluation.
Experiments show our model outperforming existing automatic evaluation metrics on both written and spoken open-domain dialogue datasets.
arXiv Detail & Related papers (2021-11-16T22:19:17Z) - Modeling Performance in Open-Domain Dialogue with PARADISE [7.516971632888974]
We develop a PARADISE model for predicting the performance of Athena, a dialogue system that has participated in thousands of conversations with real users.
Our goal is to learn a general objective function that can be used to optimize the dialogue choices of any Alexa Prize system in real time.
arXiv Detail & Related papers (2021-10-21T14:17:59Z) - We've had this conversation before: A Novel Approach to Measuring Dialog
Similarity [9.218829323265371]
We propose a novel adaptation of the edit distance metric to the scenario of dialog similarity.
Our approach takes into account various conversation aspects such as utterance semantics, conversation flow, and the participants.
arXiv Detail & Related papers (2021-10-12T07:24:12Z) - A Comprehensive Assessment of Dialog Evaluation Metrics [9.34612743192798]
Standard language evaluation metrics are ineffective for evaluating dialog.
Recent research has proposed a number of novel, dialog-specific metrics that correlate better with human judgements.
This paper provides a comprehensive assessment of recently proposed dialog evaluation metrics on a number of datasets.
arXiv Detail & Related papers (2021-06-07T15:17:03Z) - Assessing Dialogue Systems with Distribution Distances [48.61159795472962]
We propose to measure the performance of a dialogue system by computing the distribution-wise distance between its generated conversations and real-world conversations.
Experiments on several dialogue corpora show that our proposed metrics correlate better with human judgments than existing metrics.
arXiv Detail & Related papers (2021-05-06T10:30:13Z) - GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating
Open-Domain Dialogue Systems [133.13117064357425]
We propose a new evaluation metric GRADE, which stands for Graph-enhanced Representations for Automatic Dialogue Evaluation.
Specifically, GRADE incorporates both coarse-grained utterance-level contextualized representations and fine-grained topic-level graph representations to evaluate dialogue coherence.
Experimental results show that our GRADE significantly outperforms other state-of-the-art metrics on measuring diverse dialogue models.
arXiv Detail & Related papers (2020-10-08T14:07:32Z) - Rethinking Dialogue State Tracking with Reasoning [76.0991910623001]
This paper proposes to track dialogue states gradually with reasoning over dialogue turns with the help of the back-end data.
Empirical results demonstrate that our method significantly outperforms the state-of-the-art methods by 38.6% in terms of joint belief accuracy for MultiWOZ 2.1.
arXiv Detail & Related papers (2020-05-27T02:05:33Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z) - You Impress Me: Dialogue Generation via Mutual Persona Perception [62.89449096369027]
The research in cognitive science suggests that understanding is an essential signal for a high-quality chit-chat conversation.
Motivated by this, we propose P2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding.
arXiv Detail & Related papers (2020-04-11T12:51:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.