Online Learning Meets Machine Translation Evaluation: Finding the Best
Systems with the Least Human Effort
- URL: http://arxiv.org/abs/2105.13385v1
- Date: Thu, 27 May 2021 18:19:39 GMT
- Title: Online Learning Meets Machine Translation Evaluation: Finding the Best
Systems with the Least Human Effort
- Authors: V\^ania Mendon\c{c}a (1 and 2), Ricardo Rei (1 and 2 and 3), Luisa
Coheur (1 and 2), Alberto Sardinha (1 and 2), Ana L\'ucia Santos (4 and 5)
((1) INESC-ID Lisboa, (2) Instituto Superior T\'ecnico, (3) Unbabel AI, (4)
Centro de Lingu\'istica da Universidade de Lisboa, (5) Faculdade de Letras da
Universidade de Lisboa)
- Abstract summary: In Machine Translation, assessing the quality of a large amount of automatic translations can be challenging.
We propose a novel application of online learning that, given an ensemble of Machine Translation systems, dynamically converges to the best systems.
Our experiments on WMT'19 datasets show that our online approach quickly converges to the top-3 ranked systems for the language pairs considered.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In Machine Translation, assessing the quality of a large amount of automatic
translations can be challenging. Automatic metrics are not reliable when it
comes to high performing systems. In addition, resorting to human evaluators
can be expensive, especially when evaluating multiple systems. To overcome the
latter challenge, we propose a novel application of online learning that, given
an ensemble of Machine Translation systems, dynamically converges to the best
systems, by taking advantage of the human feedback available. Our experiments
on WMT'19 datasets show that our online approach quickly converges to the top-3
ranked systems for the language pairs considered, despite the lack of human
feedback for many translations.
Related papers
- Evaluating Automatic Metrics with Incremental Machine Translation Systems [55.78547133890403]
We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions.
We assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations.
arXiv Detail & Related papers (2024-07-03T17:04:17Z) - Convergences and Divergences between Automatic Assessment and Human Evaluation: Insights from Comparing ChatGPT-Generated Translation and Neural Machine Translation [1.6982207802596105]
This study investigates the convergences and divergences between automated metrics and human evaluation.
To perform automatic assessment, four automated metrics are employed, while human evaluation incorporates the DQF-MQM error typology and six rubrics.
Results underscore the indispensable role of human judgment in evaluating the performance of advanced translation tools.
arXiv Detail & Related papers (2024-01-10T14:20:33Z) - Quality Estimation of Machine Translated Texts based on Direct Evidence
from Training Data [0.0]
We show that the parallel corpus used as training data for training the MT system holds direct clues for estimating the quality of translations produced by the MT system.
Our experiments show that this simple and direct method holds promise for quality estimation of translations produced by any purely data driven machine translation system.
arXiv Detail & Related papers (2023-06-27T11:52:28Z) - Why don't people use character-level machine translation? [69.53730499849023]
Despite evidence that character-level systems are comparable with subword systems, they are virtually never used in competitive setups in machine translation competitions.
Character-level MT systems show neither better domain robustness, nor better morphological generalization, despite being often so motivated.
arXiv Detail & Related papers (2021-10-15T16:43:31Z) - It is Not as Good as You Think! Evaluating Simultaneous Machine
Translation on Interpretation Data [58.105938143865906]
We argue that SiMT systems should be trained and tested on real interpretation data.
Our results highlight the difference of up-to 13.83 BLEU score when SiMT models are evaluated on translation vs interpretation data.
arXiv Detail & Related papers (2021-10-11T12:27:07Z) - Non-Parametric Online Learning from Human Feedback for Neural Machine
Translation [54.96594148572804]
We study the problem of online learning with human feedback in the human-in-the-loop machine translation.
Previous methods require online model updating or additional translation memory networks to achieve high-quality performance.
We propose a novel non-parametric online learning method without changing the model structure.
arXiv Detail & Related papers (2021-09-23T04:26:15Z) - Difficulty-Aware Machine Translation Evaluation [19.973201669851626]
We propose a novel difficulty-aware machine translation evaluation metric.
A translation that fails to be predicted by most MT systems will be treated as a difficult one and assigned a large weight in the final score function.
Our proposed method performs well even when all the MT systems are very competitive.
arXiv Detail & Related papers (2021-07-30T02:45:36Z) - Experts, Errors, and Context: A Large-Scale Study of Human Evaluation
for Machine Translation [19.116396693370422]
We propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics framework.
We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs.
We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers.
arXiv Detail & Related papers (2021-04-29T16:42:09Z) - Machine Translation of Novels in the Age of Transformer [1.6453685972661827]
We build a machine translation system tailored to the literary domain, specifically to novels, based on the state-of-the-art architecture in neural MT (NMT), the Transformer, for the translation direction English-to-Catalan.
We compare this MT system against three other systems (two domain-specific systems under the recurrent and phrase-based paradigms and a popular generic on-line system) on three evaluations.
As expected, the domain-specific Transformer-based system outperformed the three other systems in all the three evaluations conducted, in all cases by a large margin.
arXiv Detail & Related papers (2020-11-30T16:51:08Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.