BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing
Critical Translation Errors in Sentiment-oriented Text
- URL: http://arxiv.org/abs/2109.14250v1
- Date: Wed, 29 Sep 2021 07:51:17 GMT
- Title: BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing
Critical Translation Errors in Sentiment-oriented Text
- Authors: Hadeel Saadany, Constantin Orasan
- Abstract summary: Machine Translation (MT) of the online content is commonly used to process posts written in several languages.
In this paper, we assess the ability of automatic quality metrics to detect critical machine translation errors.
We conclude that there is a need for fine-tuning of automatic metrics to make them more robust in detecting sentiment critical errors.
- Score: 1.4213973379473654
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Social media companies as well as authorities make extensive use of
artificial intelligence (AI) tools to monitor postings of hate speech,
celebrations of violence or profanity. Since AI software requires massive
volumes of data to train computers, Machine Translation (MT) of the online
content is commonly used to process posts written in several languages and
hence augment the data needed for training. However, MT mistakes are a regular
occurrence when translating sentiment-oriented user-generated content (UGC),
especially when a low-resource language is involved. The adequacy of the whole
process relies on the assumption that the evaluation metrics used give a
reliable indication of the quality of the translation. In this paper, we assess
the ability of automatic quality metrics to detect critical machine translation
errors which can cause serious misunderstanding of the affect message. We
compare the performance of three canonical metrics on meaningless translations
where the semantic content is seriously impaired as compared to meaningful
translations with a critical error which exclusively distorts the sentiment of
the source text. We conclude that there is a need for fine-tuning of automatic
metrics to make them more robust in detecting sentiment critical errors.
Related papers
- Is Context Helpful for Chat Translation Evaluation? [23.440392979857247]
We conduct a meta-evaluation of existing sentence-level automatic metrics to assess the quality of machine-translated chats.
We find that reference-free metrics lag behind reference-based ones, especially when evaluating translation quality in out-of-English settings.
We propose a new evaluation metric, Context-MQM, that utilizes bilingual context with a large language model.
arXiv Detail & Related papers (2024-03-13T07:49:50Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Competency-Aware Neural Machine Translation: Can Machine Translation
Know its Own Translation Quality? [61.866103154161884]
Neural machine translation (NMT) is often criticized for failures that happen without awareness.
We propose a novel competency-aware NMT by extending conventional NMT with a self-estimator.
We show that the proposed method delivers outstanding performance on quality estimation.
arXiv Detail & Related papers (2022-11-25T02:39:41Z) - HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using
Professional Post-Editing Towards More Effective MT Evaluation [0.0]
In this work, we introduce HOPE, a task-oriented and human-centric evaluation framework for machine translation output.
It contains only a limited number of commonly occurring error types, and use a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit.
The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR.
arXiv Detail & Related papers (2021-12-27T18:47:43Z) - Measuring Uncertainty in Translation Quality Evaluation (TQE) [62.997667081978825]
This work carries out motivated research to correctly estimate the confidence intervals citeBrown_etal2001Interval depending on the sample size of the translated text.
The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA)
arXiv Detail & Related papers (2021-11-15T12:09:08Z) - Sentiment-Aware Measure (SAM) for Evaluating Sentiment Transfer by
Machine Translation Systems [0.0]
In translating text where sentiment is the main message, human translators give particular attention to sentiment-carrying words.
We propose a numerical sentiment-closeness' measure appropriate for assessing the accuracy of a translated affect message in text by an MT system.
arXiv Detail & Related papers (2021-09-30T07:35:56Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.