Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation
- URL: http://arxiv.org/abs/2308.03131v4
- Date: Thu, 10 Aug 2023 02:08:04 GMT
- Title: Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation
- Authors: Xianfeng Zeng, Yijin Liu, Fandong Meng and Jie Zhou
- Abstract summary: N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
- Score: 55.92852268168816
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely
utilized across a range of natural language generation (NLG) tasks. However,
recent studies have revealed a weak correlation between these matching-based
metrics and human evaluations, especially when compared with neural-based
metrics like BLEURT. In this paper, we conjecture that the performance
bottleneck in matching-based metrics may be caused by the limited diversity of
references. To address this issue, we propose to utilize \textit{multiple
references} to enhance the consistency between these metrics and human
evaluations. Within the WMT Metrics benchmarks, we observe that the
multi-references F200spBLEU surpasses the conventional single-reference one by
an accuracy improvement of 7.2\%. Remarkably, it also exceeds the neural-based
BERTscore by an accuracy enhancement of 3.9\%. Moreover, we observe that the
data leakage issue in large language models (LLMs) can be mitigated to a large
extent by our multi-reference metric. We release the code and data at
\url{https://github.com/SefaZeng/LLM-Ref}
Related papers
- Using Similarity to Evaluate Factual Consistency in Summaries [2.7595794227140056]
Abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed.
We propose a new zero-shot factuality evaluation metric, Sentence-BERTScore (SBERTScore), which compares sentences between the summary and the source document.
Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries.
arXiv Detail & Related papers (2024-09-23T15:02:38Z) - Revisiting Evaluation Metrics for Semantic Segmentation: Optimization
and Evaluation of Fine-grained Intersection over Union [113.20223082664681]
We propose the use of fine-grained mIoUs along with corresponding worst-case metrics.
These fine-grained metrics offer less bias towards large objects, richer statistical information, and valuable insights into model and dataset auditing.
Our benchmark study highlights the necessity of not basing evaluations on a single metric and confirms that fine-grained mIoUs reduce the bias towards large objects.
arXiv Detail & Related papers (2023-10-30T03:45:15Z) - Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References [123.39034752499076]
Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references.
We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
arXiv Detail & Related papers (2023-05-24T11:53:29Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z) - MENLI: Robust Evaluation Metrics from Natural Language Inference [26.53850343633923]
Recently proposed BERT-based evaluation metrics for text generation perform well on standard benchmarks but are vulnerable to adversarial attacks.
We develop evaluation metrics based on Natural Language Inference (NLI)
We show that our NLI based metrics are much more robust to the attacks than the recent BERT-based metrics.
arXiv Detail & Related papers (2022-08-15T16:30:14Z) - Language Model Augmented Relevance Score [2.8314622515446835]
Language Model Augmented Relevance Score (MARS) is a new context-aware metric for NLG evaluation.
MARS uses off-the-shelf language models, guided by reinforcement learning, to create augmented references that consider both the generation context and available human references.
arXiv Detail & Related papers (2021-08-19T03:59:23Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z) - BLEU might be Guilty but References are not Innocent [34.817010352734]
We study different methods to collect references and compare their value in automated evaluation.
Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task.
Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output.
arXiv Detail & Related papers (2020-04-13T16:49:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.