REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation
- URL: http://arxiv.org/abs/2105.14488v1
- Date: Sun, 30 May 2021 10:04:13 GMT
- Title: REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation
- Authors: Jun Gao, Wei Bi, Ruifeng Xu and Shuming Shi
- Abstract summary: We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
- Score: 63.46331073232526
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The lack of reliable automatic evaluation metrics is a major impediment to
the development of open-domain dialogue systems. Various reference-based
metrics have been proposed to calculate a score between a predicted response
and a small set of references. However, these metrics show unsatisfactory
correlations with human judgments. For a reference-based metric, its
reliability mainly depends on two factors: its ability to measure the
similarity between the predicted response and the reference response, as well
as the reliability of the given reference set. Yet, there are few discussions
on the latter. Our work attempts to fill this vacancy. We first clarify an
assumption on reference-based metrics that, if more high-quality references are
added into the reference set, the reliability of the metric will increase.
Next, we present REAM$\sharp$: an enhancement approach to Reference-based
EvAluation Metrics for open-domain dialogue systems. A prediction model is
designed to estimate the reliability of the given reference set. We show how
its predicted results can be helpful to augment the reference set, and thus
improve the reliability of the metric. Experiments validate both the
effectiveness of our prediction model and that the reliability of
reference-based metrics improves with the augmented reference sets.
Related papers
- Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics [4.881135687863645]
We introduce a reference-free metric that correlates well with human evaluated relevance, while being very cheap to compute.
We show that this metric can also be used alongside reference-based metrics to improve their robustness in low quality reference settings.
arXiv Detail & Related papers (2024-10-08T11:09:25Z) - Towards an Improved Metric for Evaluating Disentangled Representations [0.6946415403594184]
Disentangled representation learning plays a pivotal role in making representations controllable, interpretable and transferable.
Despite its significance in the domain, the quest for reliable and consistent quantitative disentanglement metric remains a major challenge.
We propose a new framework for quantifying disentanglement, introducing a metric entitled emphEDI, that leverages the intuitive concept of emphexclusivity and improved factor-code relationship.
arXiv Detail & Related papers (2024-10-04T00:32:59Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References [123.39034752499076]
Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references.
We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
arXiv Detail & Related papers (2023-05-24T11:53:29Z) - DocAsRef: An Empirical Study on Repurposing Reference-Based Summary
Quality Metrics Reference-Freely [29.4981129248937]
We propose that some reference-based metrics can be effectively adapted to assess a system summary against its corresponding reference.
After being repurposed reference-freely, the zero-shot BERTScore consistently outperforms its original reference-based version.
It also excels in comparison to most existing reference-free metrics and closely competes with zero-shot summary evaluators based on GPT-3.5.
arXiv Detail & Related papers (2022-12-20T06:01:13Z) - Spurious Correlations in Reference-Free Evaluation of Text Generation [35.80256755393739]
We show that reference-free evaluation metrics of summarization and dialog generation may be relying on spurious correlations with measures such as word overlap, perplexity, and length.
We demonstrate that these errors can be mitigated by explicitly designing evaluation metrics to avoid spurious features in reference-free evaluation.
arXiv Detail & Related papers (2022-04-21T05:32:38Z) - Revisiting the Evaluation Metrics of Paraphrase Generation [35.6803390044542]
Most existing paraphrase generation models use reference-based metrics to evaluate their generated paraphrase.
This paper proposes BBScore, a reference-free metric that can reflect the generated paraphrase's quality.
arXiv Detail & Related papers (2022-02-17T07:18:54Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.