Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach
- URL: http://arxiv.org/abs/2102.10242v1
- Date: Sat, 20 Feb 2021 03:29:20 GMT
- Title: Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach
- Authors: Haoming Jiang, Bo Dai, Mengjiao Yang, Wei Wei, Tuo Zhao
- Abstract summary: We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
- Score: 84.02388020258141
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reliable automatic evaluation of dialogue systems under an interactive
environment has long been overdue. An ideal environment for evaluating dialog
systems, also known as the Turing test, needs to involve human interaction,
which is usually not affordable for large-scale experiments. Though researchers
have attempted to use metrics (e.g., perplexity, BLEU) in language generation
tasks or some model-based reinforcement learning methods (e.g., self-play
evaluation) for automatic evaluation, these methods only show a very weak
correlation with the actual human evaluation in practice. To bridge such a gap,
we propose a new framework named ENIGMA for estimating human evaluation scores
based on recent advances of off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore
does not involve human interaction with the target policy during the
evaluation, making automatic evaluations feasible. More importantly, ENIGMA is
model-free and agnostic to the behavior policies for collecting the experience
data (see details in Section 2), which significantly alleviates the technical
difficulties of modeling complex dialogue environments and human behaviors. Our
experiments show that ENIGMA significantly outperforms existing methods in
terms of correlation with human evaluation scores.
Related papers
- Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants' API Invocation Capabilities [48.922660354417204]
We propose Automated Dynamic Evaluation (AutoDE) to assess an assistant's API call capability without human involvement.
In our framework, we endeavor to closely mirror genuine human conversation patterns in human-machine interactions.
arXiv Detail & Related papers (2024-03-17T07:34:12Z) - C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue
Evaluation [68.59356746305255]
We propose a novel model-agnostic approach to measure the turn-level interaction between the system and the user.
Our approach significantly improves the correlation with human judgment compared with existing evaluation systems.
arXiv Detail & Related papers (2023-06-27T06:58:03Z) - Approximating Online Human Evaluation of Social Chatbots with Prompting [11.657633779338724]
Existing evaluation metrics aim to automate offline user evaluation and approximate human judgment of pre-curated dialogs.
We propose an approach to approximate online human evaluation leveraging large language models (LLMs) from the GPT family.
We introduce a new Dialog system Evaluation framework based on Prompting (DEP), which enables a fully automatic evaluation pipeline.
arXiv Detail & Related papers (2023-04-11T14:45:01Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - Dialogue Evaluation with Offline Reinforcement Learning [2.580163308334609]
Task-oriented dialogue systems aim to fulfill user goals through natural language interactions.
They are ideally evaluated with human users, which is unattainable to do at every iteration of the development phase.
We propose the use of offline reinforcement learning for dialogue evaluation based on a static corpus.
arXiv Detail & Related papers (2022-09-02T08:32:52Z) - GODEL: Large-Scale Pre-Training for Goal-Directed Dialog [119.1397031992088]
We introduce GODEL, a large pre-trained language model for dialog.
We show that GODEL outperforms state-of-the-art pre-trained dialog models in few-shot fine-tuning setups.
A novel feature of our evaluation methodology is the introduction of a notion of utility that assesses the usefulness of responses.
arXiv Detail & Related papers (2022-06-22T18:19:32Z) - Achieving Reliable Human Assessment of Open-Domain Dialogue Systems [24.478609926760587]
We present the successful development of human evaluation that is highly reliable while still remaining feasible and low cost.
Due to the lack of appropriate methods of statistical significance testing, the likelihood of potential improvements to systems occurring due to chance is rarely taken into account in dialogue evaluation.
arXiv Detail & Related papers (2022-03-11T13:08:39Z) - Dynamic Human Evaluation for Relative Model Comparisons [8.843915018287476]
We present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings.
We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study.
arXiv Detail & Related papers (2021-12-15T11:32:13Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.