Dialogue Evaluation with Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2209.00876v1
- Date: Fri, 2 Sep 2022 08:32:52 GMT
- Title: Dialogue Evaluation with Offline Reinforcement Learning
- Authors: Nurul Lubis, Christian Geishauser, Hsien-Chin Lin, Carel van Niekerk,
Michael Heck, Shutong Feng, Milica Ga\v{s}i\'c
- Abstract summary: Task-oriented dialogue systems aim to fulfill user goals through natural language interactions.
They are ideally evaluated with human users, which is unattainable to do at every iteration of the development phase.
We propose the use of offline reinforcement learning for dialogue evaluation based on a static corpus.
- Score: 2.580163308334609
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Task-oriented dialogue systems aim to fulfill user goals through natural
language interactions. They are ideally evaluated with human users, which
however is unattainable to do at every iteration of the development phase.
Simulated users could be an alternative, however their development is
nontrivial. Therefore, researchers resort to offline metrics on existing
human-human corpora, which are more practical and easily reproducible. They are
unfortunately limited in reflecting real performance of dialogue systems. BLEU
for instance is poorly correlated with human judgment, and existing
corpus-based metrics such as success rate overlook dialogue context mismatches.
There is still a need for a reliable metric for task-oriented systems with good
generalization and strong correlation with human judgements. In this paper, we
propose the use of offline reinforcement learning for dialogue evaluation based
on a static corpus. Such an evaluator is typically called a critic and utilized
for policy optimization. We go one step further and show that offline RL
critics can be trained on a static corpus for any dialogue system as external
evaluators, allowing dialogue performance comparisons across various types of
systems. This approach has the benefit of being corpus- and model-independent,
while attaining strong correlation with human judgements, which we confirm via
an interactive user trial.
Related papers
- Approximating Online Human Evaluation of Social Chatbots with Prompting [11.657633779338724]
Existing evaluation metrics aim to automate offline user evaluation and approximate human judgment of pre-curated dialogs.
We propose an approach to approximate online human evaluation leveraging large language models (LLMs) from the GPT family.
We introduce a new Dialog system Evaluation framework based on Prompting (DEP), which enables a fully automatic evaluation pipeline.
arXiv Detail & Related papers (2023-04-11T14:45:01Z) - Don't Forget Your ABC's: Evaluating the State-of-the-Art in
Chat-Oriented Dialogue Systems [12.914512702731528]
This paper presents a novel human evaluation method to estimate the rates of many dialogue system behaviors.
Our method is used to evaluate four state-of-the-art open-domain dialogue systems and compared with existing approaches.
arXiv Detail & Related papers (2022-12-18T22:07:55Z) - CHAI: A CHatbot AI for Task-Oriented Dialogue with Offline Reinforcement
Learning [85.3987745097806]
offline reinforcement learning can be used to train dialogue agents entirely using static datasets collected from human speakers.
Experiments show that recently developed offline RL methods can be combined with language models to yield realistic dialogue agents.
arXiv Detail & Related papers (2022-04-18T17:43:21Z) - DynaEval: Unifying Turn and Dialogue Level Evaluation [60.66883575106898]
We propose DynaEval, a unified automatic evaluation framework.
It is capable of performing turn-level evaluation, but also holistically considers the quality of the entire dialogue.
Experiments show that DynaEval significantly outperforms the state-of-the-art dialogue coherence model.
arXiv Detail & Related papers (2021-06-02T12:23:18Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical
Analysis of System-wise Evaluation [114.48767388174218]
This paper presents an empirical analysis on different types of dialog systems composed of different modules in different settings.
Our results show that a pipeline dialog system trained using fine-grained supervision signals at different component levels often obtains better performance than the systems that use joint or end-to-end models trained on coarse-grained labels.
arXiv Detail & Related papers (2020-05-15T05:20:06Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z) - You Impress Me: Dialogue Generation via Mutual Persona Perception [62.89449096369027]
The research in cognitive science suggests that understanding is an essential signal for a high-quality chit-chat conversation.
Motivated by this, we propose P2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding.
arXiv Detail & Related papers (2020-04-11T12:51:07Z) - Designing Precise and Robust Dialogue Response Evaluators [35.137244385158034]
We propose to build a reference-free evaluator and exploit the power of semi-supervised training and pretrained language models.
Experimental results demonstrate that the proposed evaluator achieves a strong correlation (> 0.6) with human judgement.
arXiv Detail & Related papers (2020-04-10T04:59:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.