Related papers: Designing Precise and Robust Dialogue Response Evaluators

Designing Precise and Robust Dialogue Response Evaluators

URL: http://arxiv.org/abs/2004.04908v2
Date: Fri, 24 Apr 2020 04:01:55 GMT
Title: Designing Precise and Robust Dialogue Response Evaluators
Authors: Tianyu Zhao, Divesh Lala, Tatsuya Kawahara
Abstract summary: We propose to build a reference-free evaluator and exploit the power of semi-supervised training and pretrained language models. Experimental results demonstrate that the proposed evaluator achieves a strong correlation (> 0.6) with human judgement.
Score: 35.137244385158034
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automatic dialogue response evaluator has been proposed as an alternative to automated metrics and human evaluation. However, existing automatic evaluators achieve only moderate correlation with human judgement and they are not robust. In this work, we propose to build a reference-free evaluator and exploit the power of semi-supervised training and pretrained (masked) language models. Experimental results demonstrate that the proposed evaluator achieves a strong correlation (> 0.6) with human judgement and generalizes robustly to diverse responses and corpora. We open-source the code and data in https://github.com/ZHAOTING/dialog-processing.

Related papers

HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF) In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination. We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z)
Learning and Evaluating Human Preferences for Conversational Head Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions. PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z)
Dialogue Evaluation with Offline Reinforcement Learning [2.580163308334609]
Task-oriented dialogue systems aim to fulfill user goals through natural language interactions. They are ideally evaluated with human users, which is unattainable to do at every iteration of the development phase. We propose the use of offline reinforcement learning for dialogue evaluation based on a static corpus.
arXiv Detail & Related papers (2022-09-02T08:32:52Z)
What is wrong with you?: Leveraging User Sentiment for Automatic Dialog Evaluation [73.03318027164605]
We propose to use information that can be automatically extracted from the next user utterance as a proxy to measure the quality of the previous system response. Our model generalizes across both spoken and written open-domain dialog corpora collected from real and paid users.
arXiv Detail & Related papers (2022-03-25T22:09:52Z)
User Response and Sentiment Prediction for Automatic Dialogue Evaluation [69.11124655437902]
We propose to use the sentiment of the next user utterance for turn or dialog level evaluation. Experiments show our model outperforming existing automatic evaluation metrics on both written and spoken open-domain dialogue datasets.
arXiv Detail & Related papers (2021-11-16T22:19:17Z)
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics [47.20761880464552]
generative dialogue modeling is widely seen as a language modeling task. The task demands an agent to have a complex natural language understanding of its input text to carry a meaningful interaction with an user. The automatic metrics used evaluate the quality of the generated text as a proxy to the holistic interaction of the agent.
arXiv Detail & Related papers (2020-08-24T13:28:35Z)
Speaker Sensitive Response Evaluation Model [17.381658875470638]
We propose an automatic evaluation model based on the similarity of the generated response with the conversational context. We learn the model parameters from an unlabeled conversation corpus. We show that our model can be applied to movie dialogues without any additional training.
arXiv Detail & Related papers (2020-06-12T08:59:10Z)
Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances. We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.