Related papers: BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation

BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation

URL: http://arxiv.org/abs/2305.19144v1
Date: Tue, 30 May 2023 15:50:46 GMT
Title: BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation
Authors: Taisiya Glushkova, Chrysoula Zerva, Andr\'e F. T. Martins
Abstract summary: We show that by using additional information during training, such as sentence-level features and word-level tags, the trained metrics improve their capability to penalize translations with specific troublesome phenomena. We show that by using additional information during training, such as sentence-level features and word-level tags, the trained metrics improve their capability to penalize translations with specific troublesome phenomena.
Score: 12.407789866525079
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although neural-based machine translation evaluation metrics, such as COMET or BLEURT, have achieved strong correlations with human judgements, they are sometimes unreliable in detecting certain phenomena that can be considered as critical errors, such as deviations in entities and numbers. In contrast, traditional evaluation metrics, such as BLEU or chrF, which measure lexical or character overlap between translation hypotheses and human references, have lower correlations with human judgements but are sensitive to such deviations. In this paper, we investigate several ways of combining the two approaches in order to increase robustness of state-of-the-art evaluation methods to translations with critical errors. We show that by using additional information during training, such as sentence-level features and word-level tags, the trained metrics improve their capability to penalize translations with specific troublesome phenomena, which leads to gains in correlation with human judgments and on recent challenge sets on several language pairs.

Related papers

Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization [13.458891794688551]
We assess evaluation metrics for generation both n-gram-based and neural based to evaluate their effectiveness across languages and tasks.<n>Our findings highlight the sensitivity of evaluation metrics to the language type.
arXiv Detail & Related papers (2025-07-11T06:44:52Z)
Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement [19.427711407628024]
Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs.<n>Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data.
arXiv Detail & Related papers (2025-05-29T07:20:36Z)
Sentence-level Aggregation of Lexical Metrics Correlate Stronger with Human Judgements than Corpus-level Aggregation [0.0]
averaging individual segment-level scores can make metrics such as BLEU and chrF correlate much stronger with human judgements. This paper can help make the evaluation of machine translation systems for low-resource languages more trustworthy.
arXiv Detail & Related papers (2024-07-03T13:46:24Z)
Exploring the Correlation between Human and Machine Evaluation of Simultaneous Speech Translation [0.9576327614980397]
This study aims to assess the reliability of automatic metrics in evaluating simultaneous interpretations by analyzing their correlation with human evaluations. As a benchmark we use human assessments performed by language experts, and evaluate how well sentence embeddings and Large Language Models correlate with them. The results suggest GPT models, particularly GPT-3.5 with direct prompting, demonstrate the strongest correlation with human judgment in terms of semantic similarity between source and target texts.
arXiv Detail & Related papers (2024-06-14T14:47:19Z)
Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs. This dataset aims to discover whether metrics can identify 68 translation accuracy errors. We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z)
Towards Multiple References Era -- Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks. Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations. We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z)
BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems. We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore. In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z)
The Inside Story: Towards Better Understanding of Machine Translation Neural Evaluation Metrics [8.432864879027724]
We develop and compare several neural explainability methods and demonstrate their effectiveness for interpreting state-of-the-art fine-tuned neural metrics. Our study reveals that these metrics leverage token-level information that can be directly attributed to translation errors.
arXiv Detail & Related papers (2023-05-19T16:42:17Z)
DEMETR: Diagnosing Evaluation Metrics for Translation [21.25704103403547]
We release DEMETR, a diagnostic dataset with 31K English examples. We find that learned metrics perform substantially better than string-based metrics on DEMETR.
arXiv Detail & Related papers (2022-10-25T03:25:44Z)
Decoding and Diversity in Machine Translation [90.33636694717954]
We characterize differences between cost diversity paid for the BLEU scores enjoyed by NMT. Our study implicates search as a salient source of known bias when translating gender pronouns.
arXiv Detail & Related papers (2020-11-26T21:09:38Z)
Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment. We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity. This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.