Evaluation of Chinese-English Machine Translation of Emotion-Loaded
Microblog Texts: A Human Annotated Dataset for the Quality Assessment of
Emotion Translation
- URL: http://arxiv.org/abs/2306.11900v1
- Date: Tue, 20 Jun 2023 21:22:45 GMT
- Title: Evaluation of Chinese-English Machine Translation of Emotion-Loaded
Microblog Texts: A Human Annotated Dataset for the Quality Assessment of
Emotion Translation
- Authors: Shenbin Qian, Constantin Orasan, Felix do Carmo, Qiuliang Li, Diptesh
Kanojia
- Abstract summary: In this paper, we focus on how current Machine Translation (MT) tools perform on the translation of emotion-loaded texts.
We propose this evaluation framework based on the Multidimensional Quality Metrics (MQM) and perform a detailed error analysis of the MT outputs.
- Score: 7.858458986992082
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we focus on how current Machine Translation (MT) tools perform
on the translation of emotion-loaded texts by evaluating outputs from Google
Translate according to a framework proposed in this paper. We propose this
evaluation framework based on the Multidimensional Quality Metrics (MQM) and
perform a detailed error analysis of the MT outputs. From our analysis, we
observe that about 50% of the MT outputs fail to preserve the original emotion.
After further analysis of the errors, we find that emotion carrying words and
linguistic phenomena such as polysemous words, negation, abbreviation etc., are
common causes for these translation errors.
Related papers
- Are Large Language Models State-of-the-art Quality Estimators for Machine Translation of User-generated Content? [6.213698466889738]
This paper investigates whether large language models (LLMs) are state-of-the-art quality estimators for machine translation of user-generated content (UGC)
We employ an existing emotion-related dataset with human-annotated errors and calculate quality evaluation scores based on the Multi-dimensional Quality Metrics.
arXiv Detail & Related papers (2024-10-08T20:16:59Z) - A Multi-task Learning Framework for Evaluating Machine Translation of Emotion-loaded User-generated Content [6.213698466889738]
Machine translation of user-generated content (UGC) poses unique challenges, including handling slang, emotion, and literary devices like irony and sarcasm.
We utilize an existing emotion-related dataset that includes emotion labels and human-annotated translation errors.
We extend it with sentence-level evaluation scores and word-level labels, leading to a dataset suitable for sentence- and word-level translation evaluation and emotion classification.
arXiv Detail & Related papers (2024-10-04T09:49:57Z) - Understanding and Addressing the Under-Translation Problem from the Perspective of Decoding Objective [72.83966378613238]
Under-translation and over-translation remain two challenging problems in state-of-the-art Neural Machine Translation (NMT) systems.
We conduct an in-depth analysis on the underlying cause of under-translation in NMT, providing an explanation from the perspective of decoding objective.
We propose employing the confidence of predicting End Of Sentence (EOS) as a detector for under-translation, and strengthening the confidence-based penalty to penalize candidates with a high risk of under-translation.
arXiv Detail & Related papers (2024-05-29T09:25:49Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - Discourse Centric Evaluation of Machine Translation with a Densely
Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al.
We investigate the similarities and differences between the discourse structures of source and target languages.
We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Rethink about the Word-level Quality Estimation for Machine Translation
from Human Judgement [57.72846454929923]
We create a benchmark dataset, emphHJQE, where the expert translators directly annotate poorly translated words.
We propose two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE corpus closer to emphHJQE.
The results show our proposed dataset is more consistent with human judgement and also confirm the effectiveness of the proposed tag correcting strategies.
arXiv Detail & Related papers (2022-09-13T02:37:12Z) - Original or Translated? A Causal Analysis of the Impact of
Translationese on Machine Translation Performance [31.47795931399995]
Human-translated text displays distinct features from naturally written text in the same language.
We find that existing work on translationese neglects some important factors and the conclusions are mostly correlational but not causal.
We show that these two factors have a large causal effect on the MT performance.
arXiv Detail & Related papers (2022-05-04T19:17:55Z) - Measuring Uncertainty in Translation Quality Evaluation (TQE) [62.997667081978825]
This work carries out motivated research to correctly estimate the confidence intervals citeBrown_etal2001Interval depending on the sample size of the translated text.
The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA)
arXiv Detail & Related papers (2021-11-15T12:09:08Z) - Sentiment-Aware Measure (SAM) for Evaluating Sentiment Transfer by
Machine Translation Systems [0.0]
In translating text where sentiment is the main message, human translators give particular attention to sentiment-carrying words.
We propose a numerical sentiment-closeness' measure appropriate for assessing the accuracy of a translated affect message in text by an MT system.
arXiv Detail & Related papers (2021-09-30T07:35:56Z) - BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing
Critical Translation Errors in Sentiment-oriented Text [1.4213973379473654]
Machine Translation (MT) of the online content is commonly used to process posts written in several languages.
In this paper, we assess the ability of automatic quality metrics to detect critical machine translation errors.
We conclude that there is a need for fine-tuning of automatic metrics to make them more robust in detecting sentiment critical errors.
arXiv Detail & Related papers (2021-09-29T07:51:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.