Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach
- URL: http://arxiv.org/abs/2102.10242v1
- Date: Sat, 20 Feb 2021 03:29:20 GMT
- Title: Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach
- Authors: Haoming Jiang, Bo Dai, Mengjiao Yang, Wei Wei, Tuo Zhao
- Abstract summary: We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
- Score: 84.02388020258141
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reliable automatic evaluation of dialogue systems under an interactive
environment has long been overdue. An ideal environment for evaluating dialog
systems, also known as the Turing test, needs to involve human interaction,
which is usually not affordable for large-scale experiments. Though researchers
have attempted to use metrics (e.g., perplexity, BLEU) in language generation
tasks or some model-based reinforcement learning methods (e.g., self-play
evaluation) for automatic evaluation, these methods only show a very weak
correlation with the actual human evaluation in practice. To bridge such a gap,
we propose a new framework named ENIGMA for estimating human evaluation scores
based on recent advances of off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore
does not involve human interaction with the target policy during the
evaluation, making automatic evaluations feasible. More importantly, ENIGMA is
model-free and agnostic to the behavior policies for collecting the experience
data (see details in Section 2), which significantly alleviates the technical
difficulties of modeling complex dialogue environments and human behaviors. Our
experiments show that ENIGMA significantly outperforms existing methods in
terms of correlation with human evaluation scores.
Related papers
- MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation [0.4857223913212445]
We propose a novel system, MIRROR, to automate the evaluation process for questions generated by automated question generation systems.
We observed that the scores of human evaluation metrics, namely relevance, appropriateness, novelty, complexity, and grammaticality, improved when using the feedback-based approach called MIRROR.
arXiv Detail & Related papers (2024-10-16T12:24:42Z) - An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation [29.81362106367831]
Existing evaluation methods often suffer from high costs, limited test formats, the need of human references, and systematic evaluation biases.
In contrast to previous studies that rely on human annotations, Auto-PRE selects evaluators automatically based on their inherent traits.
Experimental results indicate our Auto-PRE achieves state-of-the-art performance at a lower cost.
arXiv Detail & Related papers (2024-10-16T06:06:06Z) - IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering [10.338962367542331]
In this work, we introduce an automatic evaluation framework IQA-EVAL to Interactive Question Answering Evaluation.
More specifically, we introduce LLM-based Evaluation Agent (LEA) that can: (1) simulate human behaviors to generate interactions with IQA models; (2) automatically evaluate the generated interactions.
We show that our evaluation framework with GPT-4 as the backbone model achieves a high correlation with human evaluations on the IQA task.
arXiv Detail & Related papers (2024-08-24T10:34:20Z) - Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants' API Invocation Capabilities [48.922660354417204]
We propose Automated Dynamic Evaluation (AutoDE) to assess an assistant's API call capability without human involvement.
In our framework, we endeavor to closely mirror genuine human conversation patterns in human-machine interactions.
arXiv Detail & Related papers (2024-03-17T07:34:12Z) - C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue
Evaluation [68.59356746305255]
We propose a novel model-agnostic approach to measure the turn-level interaction between the system and the user.
Our approach significantly improves the correlation with human judgment compared with existing evaluation systems.
arXiv Detail & Related papers (2023-06-27T06:58:03Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Approximating Online Human Evaluation of Social Chatbots with Prompting [11.657633779338724]
Existing evaluation metrics aim to automate offline user evaluation and approximate human judgment of pre-curated dialogs.
We propose an approach to approximate online human evaluation leveraging large language models (LLMs) from the GPT family.
We introduce a new Dialog system Evaluation framework based on Prompting (DEP), which enables a fully automatic evaluation pipeline.
arXiv Detail & Related papers (2023-04-11T14:45:01Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.