Related papers: Rethinking Response Evaluation from Interlocutor's Eye for Open-Domain Dialogue Systems

Rethinking Response Evaluation from Interlocutor's Eye for Open-Domain Dialogue Systems

URL: http://arxiv.org/abs/2401.02256v1
Date: Thu, 4 Jan 2024 13:15:41 GMT
Title: Rethinking Response Evaluation from Interlocutor's Eye for Open-Domain Dialogue Systems
Authors: Yuma Tsuta, Naoki Yoshinaga, Shoetsu Sato and Masashi Toyoda
Abstract summary: We analyzed and examined what features are needed in an automatic response evaluator from the interlocutor's perspective. The first experiment on the Hazumi dataset revealed that interlocutor awareness plays a critical role in making automatic response evaluation correlate with the interlocutor's judgments. The second experiment using massive conversations on X (formerly Twitter) confirmed that dialogue continuity prediction can train an interlocutor-aware response evaluator without human feedback.
Score: 14.98159964397052
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open-domain dialogue systems have started to engage in continuous conversations with humans. Those dialogue systems are required to be adjusted to the human interlocutor and evaluated in terms of their perspective. However, it is questionable whether the current automatic evaluation methods can approximate the interlocutor's judgments. In this study, we analyzed and examined what features are needed in an automatic response evaluator from the interlocutor's perspective. The first experiment on the Hazumi dataset revealed that interlocutor awareness plays a critical role in making automatic response evaluation correlate with the interlocutor's judgments. The second experiment using massive conversations on X (formerly Twitter) confirmed that dialogue continuity prediction can train an interlocutor-aware response evaluator without human feedback while revealing the difficulty in evaluating generated responses compared to human responses.

Related papers

Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs [57.16442740983528]
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback. The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied. We focus on how the evaluation of task-oriented dialogue systems ( TDSs) is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated.
arXiv Detail & Related papers (2024-04-19T16:45:50Z)
PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison [38.03304773600225]
PairEval is a novel dialogue evaluation metric for assessing responses by comparing their quality against responses in different conversations. We show that PairEval exhibits a higher correlation with human judgments than baseline metrics. We also find that the proposed comparative metric is more robust in detecting common failures from open-domain dialogue systems.
arXiv Detail & Related papers (2024-04-01T09:35:06Z)
An Analysis of User Behaviors for Objectively Evaluating Spoken Dialogue Systems [26.003947740875482]
We investigate the relationship between user behaviors and subjective evaluation scores in social dialogue tasks. The results reveal that in dialogue tasks where user utterances are primary, like attentive listening and job interview, indicators like the number of utterances and words play a significant role in evaluation.
arXiv Detail & Related papers (2024-01-10T01:02:26Z)
WHAT, WHEN, and HOW to Ground: Designing User Persona-Aware Conversational Agents for Engaging Dialogue [4.328280329592151]
We present a method for building a personalized open-domain dialogue system to address the WWH problem for natural response generation in a commercial setting. The proposed approach involves weighted dataset blending, negative persona information augmentation methods, and the design of personalized conversation datasets. Our work effectively balances dialogue fluency and tendency to ground, while also introducing a response-type label to improve the controllability and explainability of the grounded responses.
arXiv Detail & Related papers (2023-06-06T02:28:38Z)
ACCENT: An Automatic Event Commonsense Evaluation Metric for Open-Domain Dialogue Systems [81.8658402934838]
We propose ACCENT, an event commonsense evaluation empowered by commonsense knowledge bases (CSKBs) Our experiments show that ACCENT is an efficient metric for event commonsense evaluation, which achieves higher correlations with human judgments than existing baselines.
arXiv Detail & Related papers (2023-05-12T23:11:48Z)
Response-act Guided Reinforced Dialogue Generation for Mental Health Counseling [25.524804770124145]
We present READER, a dialogue-act guided response generator for mental health counseling conversations. READER is built on transformer to jointly predict a potential dialogue-act d(t+1) for the next utterance (aka response-act) and to generate an appropriate response u(t+1) We evaluate READER on HOPE, a benchmark counseling conversation dataset.
arXiv Detail & Related papers (2023-01-30T08:53:35Z)
User Response and Sentiment Prediction for Automatic Dialogue Evaluation [69.11124655437902]
We propose to use the sentiment of the next user utterance for turn or dialog level evaluation. Experiments show our model outperforming existing automatic evaluation metrics on both written and spoken open-domain dialogue datasets.
arXiv Detail & Related papers (2021-11-16T22:19:17Z)
DynaEval: Unifying Turn and Dialogue Level Evaluation [60.66883575106898]
We propose DynaEval, a unified automatic evaluation framework. It is capable of performing turn-level evaluation, but also holistically considers the quality of the entire dialogue. Experiments show that DynaEval significantly outperforms the state-of-the-art dialogue coherence model.
arXiv Detail & Related papers (2021-06-02T12:23:18Z)
Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances. We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
You Impress Me: Dialogue Generation via Mutual Persona Perception [62.89449096369027]
The research in cognitive science suggests that understanding is an essential signal for a high-quality chit-chat conversation. Motivated by this, we propose P2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding.
arXiv Detail & Related papers (2020-04-11T12:51:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.