Related papers: Sentence-level Aggregation of Lexical Metrics Correlate Stronger with Human Judgements than Corpus-level Aggregation

Sentence-level Aggregation of Lexical Metrics Correlate Stronger with Human Judgements than Corpus-level Aggregation

URL: http://arxiv.org/abs/2407.12832v1
Date: Wed, 3 Jul 2024 13:46:24 GMT
Title: Sentence-level Aggregation of Lexical Metrics Correlate Stronger with Human Judgements than Corpus-level Aggregation
Authors: Paulo Cavalin, Pedro Henrique Domingues, Claudio Pinhanez,
Abstract summary: averaging individual segment-level scores can make metrics such as BLEU and chrF correlate much stronger with human judgements. This paper can help make the evaluation of machine translation systems for low-resource languages more trustworthy.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper we show that corpus-level aggregation hinders considerably the capability of lexical metrics to accurately evaluate machine translation (MT) systems. With empirical experiments we demonstrate that averaging individual segment-level scores can make metrics such as BLEU and chrF correlate much stronger with human judgements and make them behave considerably more similar to neural metrics such as COMET and BLEURT. We show that this difference exists because corpus- and segment-level aggregation differs considerably owing to the classical average of ratio versus ratio of averages Mathematical problem. Moreover, as we also show, such difference affects considerably the statistical robustness of corpus-level aggregation. Considering that neural metrics currently only cover a small set of sufficiently-resourced languages, the results in this paper can help make the evaluation of MT systems for low-resource languages more trustworthy.

Related papers

Towards Multiple References Era -- Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks. Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations. We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z)
BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems. We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore. In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z)
BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation [12.407789866525079]
We show that by using additional information during training, such as sentence-level features and word-level tags, the trained metrics improve their capability to penalize translations with specific troublesome phenomena. We show that by using additional information during training, such as sentence-level features and word-level tags, the trained metrics improve their capability to penalize translations with specific troublesome phenomena.
arXiv Detail & Related papers (2023-05-30T15:50:46Z)
The Inside Story: Towards Better Understanding of Machine Translation Neural Evaluation Metrics [8.432864879027724]
We develop and compare several neural explainability methods and demonstrate their effectiveness for interpreting state-of-the-art fine-tuned neural metrics. Our study reveals that these metrics leverage token-level information that can be directly attributed to translation errors.
arXiv Detail & Related papers (2023-05-19T16:42:17Z)
Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z)
SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations. We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences. Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z)
LCEval: Learned Composite Metric for Caption Evaluation [37.2313913156926]
We propose a neural network-based learned metric to improve the caption-level caption evaluation. This paper investigates the relationship between different linguistic features and the caption-level correlation of the learned metrics. Our proposed metric not only outperforms the existing metrics in terms of caption-level correlation but it also shows a strong system-level correlation against human assessments.
arXiv Detail & Related papers (2020-12-24T06:38:24Z)
Decoding and Diversity in Machine Translation [90.33636694717954]
We characterize differences between cost diversity paid for the BLEU scores enjoyed by NMT. Our study implicates search as a salient source of known bias when translating gender pronouns.
arXiv Detail & Related papers (2020-11-26T21:09:38Z)
On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity. This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)
BLEU might be Guilty but References are not Innocent [34.817010352734]
We study different methods to collect references and compare their value in automated evaluation. Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task. Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output.
arXiv Detail & Related papers (2020-04-13T16:49:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.