Relevance in Dialogue: Is Less More? An Empirical Comparison of Existing
Metrics, and a Novel Simple Metric
- URL: http://arxiv.org/abs/2206.01823v1
- Date: Fri, 3 Jun 2022 21:23:05 GMT
- Title: Relevance in Dialogue: Is Less More? An Empirical Comparison of Existing
Metrics, and a Novel Simple Metric
- Authors: Ian Berlot-Attwell and Frank Rudzicz
- Abstract summary: We propose modifications to reduce data requirements and domain sensitivity while improving correlation.
Our proposed metric achieves state-of-the-art performance on the HUMOD dataset while reducing measured sensitivity to dataset by 37%-66%.
- Score: 18.690461703947047
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this work, we evaluate various existing dialogue relevance metrics, find
strong dependency on the dataset, often with poor correlation with human scores
of relevance, and propose modifications to reduce data requirements and domain
sensitivity while improving correlation. Our proposed metric achieves
state-of-the-art performance on the HUMOD dataset while reducing measured
sensitivity to dataset by 37%-66%. We achieve this without fine-tuning a
pretrained language model, and using only 3,750 unannotated human dialogues and
a single negative example. Despite these limitations, we demonstrate
competitive performance on four datasets from different domains. Our code,
including our metric and experiments, is open sourced.
Related papers
- An Empirical Comparison of Text Summarization: A Multi-Dimensional Evaluation of Large Language Models [2.1945750784330067]
This research evaluates summarization performance across 17 large language models (OpenAI, Google, Anthropic, open-source)
We assessed models on seven diverse datasets using metrics for factual consistency, semantic similarity, lexical overlap, and human-like quality.
arXiv Detail & Related papers (2025-04-06T16:24:22Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals [91.59906995214209]
We propose a new evaluation method, Counterfactual Attentiveness Test (CAT)
CAT uses counterfactuals by replacing part of the input with its counterpart from a different example, expecting an attentive model to change its prediction.
We show that GPT3 becomes less attentive with an increased number of demonstrations, while its accuracy on the test data improves.
arXiv Detail & Related papers (2023-11-16T06:27:35Z) - What is the Best Automated Metric for Text to Motion Generation? [19.71712698183703]
There is growing interest in generating skeleton-based human motions from natural language descriptions.
Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments.
This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better.
arXiv Detail & Related papers (2023-09-19T01:59:54Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - ED-FAITH: Evaluating Dialogue Summarization on Faithfulness [35.73012379398233]
We first present a systematic study of faithfulness metrics for dialogue summarization.
We observe that most metrics correlate poorly with human judgements despite performing well on news datasets.
We propose T0-Score -- a new metric for faithfulness evaluation.
arXiv Detail & Related papers (2022-11-15T19:33:50Z) - DEAM: Dialogue Coherence Evaluation using AMR-based Semantic
Manipulations [46.942369532632604]
We propose a Dialogue Evaluation metric that relies on AMR-based semantic manipulations for incoherent data generation.
Our experiments show that DEAM achieves higher correlations with human judgments compared to baseline methods.
arXiv Detail & Related papers (2022-03-18T03:11:35Z) - Process for Adapting Language Models to Society (PALMS) with
Values-Targeted Datasets [0.0]
Language models can generate harmful and biased outputs and exhibit undesirable behavior.
We propose a Process for Adapting Language Models to Society (PALMS) with Values-Targeted datasets.
We show that significantly adjusting language model behavior is feasible with a small, hand-curated dataset.
arXiv Detail & Related papers (2021-06-18T19:38:28Z) - Domain Adaptative Causality Encoder [52.779274858332656]
We leverage the characteristics of dependency trees and adversarial learning to address the tasks of adaptive causality identification and localisation.
We present a new causality dataset, namely MedCaus, which integrates all types of causality in the text.
arXiv Detail & Related papers (2020-11-27T04:14:55Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.