Sentiment-Aware Measure (SAM) for Evaluating Sentiment Transfer by
Machine Translation Systems
- URL: http://arxiv.org/abs/2109.14895v1
- Date: Thu, 30 Sep 2021 07:35:56 GMT
- Title: Sentiment-Aware Measure (SAM) for Evaluating Sentiment Transfer by
Machine Translation Systems
- Authors: Hadeel Saadany, Hadeel Saadany, Emad Mohamed, Ashraf Tantavy
- Abstract summary: In translating text where sentiment is the main message, human translators give particular attention to sentiment-carrying words.
We propose a numerical sentiment-closeness' measure appropriate for assessing the accuracy of a translated affect message in text by an MT system.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In translating text where sentiment is the main message, human translators
give particular attention to sentiment-carrying words. The reason is that an
incorrect translation of such words would miss the fundamental aspect of the
source text, i.e. the author's sentiment. In the online world, MT systems are
extensively used to translate User-Generated Content (UGC) such as reviews,
tweets, and social media posts, where the main message is often the author's
positive or negative attitude towards the topic of the text. It is important in
such scenarios to accurately measure how far an MT system can be a reliable
real-life utility in transferring the correct affect message. This paper
tackles an under-recognised problem in the field of machine translation
evaluation which is judging to what extent automatic metrics concur with the
gold standard of human evaluation for a correct translation of sentiment. We
evaluate the efficacy of conventional quality metrics in spotting a
mistranslation of sentiment, especially when it is the sole error in the MT
output. We propose a numerical `sentiment-closeness' measure appropriate for
assessing the accuracy of a translated affect message in UGC text by an MT
system. We will show that incorporating this sentiment-aware measure can
significantly enhance the correlation of some available quality metrics with
the human judgement of an accurate translation of sentiment.
Related papers
- BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Evaluation of Chinese-English Machine Translation of Emotion-Loaded
Microblog Texts: A Human Annotated Dataset for the Quality Assessment of
Emotion Translation [7.858458986992082]
In this paper, we focus on how current Machine Translation (MT) tools perform on the translation of emotion-loaded texts.
We propose this evaluation framework based on the Multidimensional Quality Metrics (MQM) and perform a detailed error analysis of the MT outputs.
arXiv Detail & Related papers (2023-06-20T21:22:45Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Competency-Aware Neural Machine Translation: Can Machine Translation
Know its Own Translation Quality? [61.866103154161884]
Neural machine translation (NMT) is often criticized for failures that happen without awareness.
We propose a novel competency-aware NMT by extending conventional NMT with a self-estimator.
We show that the proposed method delivers outstanding performance on quality estimation.
arXiv Detail & Related papers (2022-11-25T02:39:41Z) - A Semi-supervised Approach for a Better Translation of Sentiment in
Dialectical Arabic UGT [2.6763498831034034]
We introduce a semi-supervised approach that exploits both monolingual and parallel data for training an NMT system.
We will show that our proposed system can significantly help with correcting sentiment errors detected in the online translation of dialectical Arabic UGT.
arXiv Detail & Related papers (2022-10-21T11:55:55Z) - Rethink about the Word-level Quality Estimation for Machine Translation
from Human Judgement [57.72846454929923]
We create a benchmark dataset, emphHJQE, where the expert translators directly annotate poorly translated words.
We propose two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE corpus closer to emphHJQE.
The results show our proposed dataset is more consistent with human judgement and also confirm the effectiveness of the proposed tag correcting strategies.
arXiv Detail & Related papers (2022-09-13T02:37:12Z) - HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using
Professional Post-Editing Towards More Effective MT Evaluation [0.0]
In this work, we introduce HOPE, a task-oriented and human-centric evaluation framework for machine translation output.
It contains only a limited number of commonly occurring error types, and use a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit.
The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR.
arXiv Detail & Related papers (2021-12-27T18:47:43Z) - BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing
Critical Translation Errors in Sentiment-oriented Text [1.4213973379473654]
Machine Translation (MT) of the online content is commonly used to process posts written in several languages.
In this paper, we assess the ability of automatic quality metrics to detect critical machine translation errors.
We conclude that there is a need for fine-tuning of automatic metrics to make them more robust in detecting sentiment critical errors.
arXiv Detail & Related papers (2021-09-29T07:51:17Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z) - Can Your Context-Aware MT System Pass the DiP Benchmark Tests? :
Evaluation Benchmarks for Discourse Phenomena in Machine Translation [7.993547048820065]
We introduce the first of their kind MT benchmark datasets that aim to track and hail improvements across four main discourse phenomena.
Surprisingly, we find that existing context-aware models do not improve discourse-related translations consistently across languages and phenomena.
arXiv Detail & Related papers (2020-04-30T07:15:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.