Breeding Machine Translations: Evolutionary approach to survive and
thrive in the world of automated evaluation
- URL: http://arxiv.org/abs/2305.19330v1
- Date: Tue, 30 May 2023 18:00:25 GMT
- Title: Breeding Machine Translations: Evolutionary approach to survive and
thrive in the world of automated evaluation
- Authors: Josef Jon and Ond\v{r}ej Bojar
- Abstract summary: We propose a genetic algorithm (GA) based method for modifying n-best lists produced by a machine translation (MT) system.
Our method offers an innovative approach to improving MT quality and identifying weaknesses in evaluation metrics.
- Score: 1.90365714903665
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a genetic algorithm (GA) based method for modifying n-best lists
produced by a machine translation (MT) system. Our method offers an innovative
approach to improving MT quality and identifying weaknesses in evaluation
metrics. Using common GA operations (mutation and crossover) on a list of
hypotheses in combination with a fitness function (an arbitrary MT metric), we
obtain novel and diverse outputs with high metric scores. With a combination of
multiple MT metrics as the fitness function, the proposed method leads to an
increase in translation quality as measured by other held-out automatic
metrics. With a single metric (including popular ones such as COMET) as the
fitness function, we find blind spots and flaws in the metric. This allows for
an automated search for adversarial examples in an arbitrary metric, without
prior assumptions on the form of such example. As a demonstration of the
method, we create datasets of adversarial examples and use them to show that
reference-free COMET is substantially less robust than the reference-based
version.
Related papers
- Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics [46.71836180414362]
We introduce an interpretable evaluation framework for Machine Translation (MT) metrics.
Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases.
We also raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines.
arXiv Detail & Related papers (2024-10-07T16:42:10Z) - We Need to Talk About Classification Evaluation Metrics in NLP [34.73017509294468]
In Natural Language Processing (NLP) model generalizability is generally measured with standard metrics such as Accuracy, F-Measure, or AUC-ROC.
The diversity of metrics, and the arbitrariness of their application suggest that there is no agreement within NLP on a single best metric to use.
We demonstrate that a random-guess normalised Informedness metric is a parsimonious baseline for task performance.
arXiv Detail & Related papers (2024-01-08T11:40:48Z) - A Study of Unsupervised Evaluation Metrics for Practical and Automatic
Domain Adaptation [15.728090002818963]
Unsupervised domain adaptation (UDA) methods facilitate the transfer of models to target domains without labels.
In this paper, we aim to find an evaluation metric capable of assessing the quality of a transferred model without access to target validation labels.
arXiv Detail & Related papers (2023-08-01T05:01:05Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand [117.62186420147563]
We propose a generalization of leaderboards, bidimensional leaderboards (Billboards)
Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries.
We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation.
arXiv Detail & Related papers (2021-12-08T06:34:58Z) - Meta-Generating Deep Attentive Metric for Few-shot Classification [53.07108067253006]
We present a novel deep metric meta-generation method to generate a specific metric for a new few-shot learning task.
In this study, we structure the metric using a three-layer deep attentive network that is flexible enough to produce a discriminative metric for each task.
We gain surprisingly obvious performance improvement over state-of-the-art competitors, especially in the challenging cases.
arXiv Detail & Related papers (2020-12-03T02:07:43Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z) - BLEU might be Guilty but References are not Innocent [34.817010352734]
We study different methods to collect references and compare their value in automated evaluation.
Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task.
Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output.
arXiv Detail & Related papers (2020-04-13T16:49:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.