Measuring the Robustness of Reference-Free Dialogue Evaluation Systems
- URL: http://arxiv.org/abs/2501.06728v1
- Date: Sun, 12 Jan 2025 06:41:52 GMT
- Title: Measuring the Robustness of Reference-Free Dialogue Evaluation Systems
- Authors: Justin Vasselli, Adam Nohejl, Taro Watanabe,
- Abstract summary: We present a benchmark for evaluating the robustness of reference-free dialogue metrics against four categories of adversarial attacks.
We analyze metrics such as DialogRPT, UniEval, and PromptEval across grounded and ungrounded datasets.
- Score: 12.332146893333952
- License:
- Abstract: Advancements in dialogue systems powered by large language models (LLMs) have outpaced the development of reliable evaluation metrics, particularly for diverse and creative responses. We present a benchmark for evaluating the robustness of reference-free dialogue metrics against four categories of adversarial attacks: speaker tag prefixes, static responses, ungrammatical responses, and repeated conversational context. We analyze metrics such as DialogRPT, UniEval, and PromptEval -- a prompt-based method leveraging LLMs -- across grounded and ungrounded datasets. By examining both their correlation with human judgment and susceptibility to adversarial attacks, we find that these two axes are not always aligned; metrics that appear to be equivalent when judged by traditional benchmarks may, in fact, vary in their scores of adversarial responses. These findings motivate the development of nuanced evaluation frameworks to address real-world dialogue challenges.
Related papers
- CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems [43.5428962271088]
We propose a novel metric, called CausalScore, which assesses the relevance of responses by measuring the causal strength between dialogue histories and responses.
Our experimental results demonstrate that CausalScore significantly surpasses existing state-of-the-art metrics by aligning better with human judgements.
arXiv Detail & Related papers (2024-06-25T06:08:16Z) - Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems [57.16442740983528]
Crowdsourced labels play a crucial role in evaluating task-oriented dialogue systems.
Previous studies suggest using only a portion of the dialogue context in the annotation process.
This study investigates the influence of dialogue context on annotation quality.
arXiv Detail & Related papers (2024-04-15T17:56:39Z) - PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison [38.03304773600225]
PairEval is a novel dialogue evaluation metric for assessing responses by comparing their quality against responses in different conversations.
We show that PairEval exhibits a higher correlation with human judgments than baseline metrics.
We also find that the proposed comparative metric is more robust in detecting common failures from open-domain dialogue systems.
arXiv Detail & Related papers (2024-04-01T09:35:06Z) - Simple LLM Prompting is State-of-the-Art for Robust and Multilingual
Dialogue Evaluation [7.767020408405403]
We propose a novel framework that takes advantage of the strengths of current evaluation models with the newly-established paradigm of prompting Large Language Models (LLMs)
Empirical results show our framework achieves state of the art results in terms of mean Spearman correlation scores across several benchmarks.
arXiv Detail & Related papers (2023-08-31T15:19:28Z) - Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response [56.25966921370483]
There are challenges in using reference-free evaluators based on large language models.
Reference-free evaluators are more suitable for open-ended examples with different semantics responses.
There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.
arXiv Detail & Related papers (2023-05-24T02:52:48Z) - Don't Forget Your ABC's: Evaluating the State-of-the-Art in
Chat-Oriented Dialogue Systems [12.914512702731528]
This paper presents a novel human evaluation method to estimate the rates of many dialogue system behaviors.
Our method is used to evaluate four state-of-the-art open-domain dialogue systems and compared with existing approaches.
arXiv Detail & Related papers (2022-12-18T22:07:55Z) - FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation [58.46761798403072]
We propose a dialogue-level metric that consists of three sub-metrics with each targeting a specific dimension.
The sub-metrics are trained with novel self-supervised objectives and exhibit strong correlations with human judgment for their respective dimensions.
Compared to the existing state-of-the-art metric, the combined metrics achieve around 16% relative improvement on average.
arXiv Detail & Related papers (2022-10-25T08:26:03Z) - DynaEval: Unifying Turn and Dialogue Level Evaluation [60.66883575106898]
We propose DynaEval, a unified automatic evaluation framework.
It is capable of performing turn-level evaluation, but also holistically considers the quality of the entire dialogue.
Experiments show that DynaEval significantly outperforms the state-of-the-art dialogue coherence model.
arXiv Detail & Related papers (2021-06-02T12:23:18Z) - Assessing Dialogue Systems with Distribution Distances [48.61159795472962]
We propose to measure the performance of a dialogue system by computing the distribution-wise distance between its generated conversations and real-world conversations.
Experiments on several dialogue corpora show that our proposed metrics correlate better with human judgments than existing metrics.
arXiv Detail & Related papers (2021-05-06T10:30:13Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.