Understanding the Impact of UGC Specificities on Translation Quality
- URL: http://arxiv.org/abs/2110.12551v1
- Date: Sun, 24 Oct 2021 23:25:29 GMT
- Title: Understanding the Impact of UGC Specificities on Translation Quality
- Authors: Jos\'e Carlos Rosales N\'u\~nez, Djam\'e Seddah, Guillaume Wisniewski
- Abstract summary: This work takes a critical look at the evaluation of user-generated content automatic translation.
measuring the average-case performance using a standard metric on a test set falls far short of giving a reliable image of the translation quality.
- Score: 6.123324869194193
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work takes a critical look at the evaluation of user-generated content
automatic translation, the well-known specificities of which raise many
challenges for MT. Our analyses show that measuring the average-case
performance using a standard metric on a UGC test set falls far short of giving
a reliable image of the UGC translation quality. That is why we introduce a new
data set for the evaluation of UGC translation in which UGC specificities have
been manually annotated using a fine-grained typology. Using this data set, we
conduct several experiments to measure the impact of different kinds of UGC
specificities on translation quality, more precisely than previously possible.
Related papers
- Are Large Language Models State-of-the-art Quality Estimators for Machine Translation of User-generated Content? [6.213698466889738]
This paper investigates whether large language models (LLMs) are state-of-the-art quality estimators for machine translation of user-generated content (UGC)
We employ an existing emotion-related dataset with human-annotated errors and calculate quality evaluation scores based on the Multi-dimensional Quality Metrics.
arXiv Detail & Related papers (2024-10-08T20:16:59Z) - Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics [46.71836180414362]
We introduce an interpretable evaluation framework for Machine Translation (MT) metrics.
Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases.
We also raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines.
arXiv Detail & Related papers (2024-10-07T16:42:10Z) - A Multi-task Learning Framework for Evaluating Machine Translation of Emotion-loaded User-generated Content [6.213698466889738]
Machine translation of user-generated content (UGC) poses unique challenges, including handling slang, emotion, and literary devices like irony and sarcasm.
We utilize an existing emotion-related dataset that includes emotion labels and human-annotated translation errors.
We extend it with sentence-level evaluation scores and word-level labels, leading to a dataset suitable for sentence- and word-level translation evaluation and emotion classification.
arXiv Detail & Related papers (2024-10-04T09:49:57Z) - Evaluating Automatic Metrics with Incremental Machine Translation Systems [55.78547133890403]
We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions.
We assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations.
arXiv Detail & Related papers (2024-07-03T17:04:17Z) - Guiding In-Context Learning of LLMs through Quality Estimation for Machine Translation [0.846600473226587]
This paper presents a novel methodology for in-context learning (ICL) that relies on a search algorithm guided by domain-specific quality estimation (QE)
Our results demonstrate significant improvements over existing ICL methods and higher translation performance compared to fine-tuning a pre-trained language model (PLM)
arXiv Detail & Related papers (2024-06-12T07:49:36Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Rethink about the Word-level Quality Estimation for Machine Translation
from Human Judgement [57.72846454929923]
We create a benchmark dataset, emphHJQE, where the expert translators directly annotate poorly translated words.
We propose two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE corpus closer to emphHJQE.
The results show our proposed dataset is more consistent with human judgement and also confirm the effectiveness of the proposed tag correcting strategies.
arXiv Detail & Related papers (2022-09-13T02:37:12Z) - BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing
Critical Translation Errors in Sentiment-oriented Text [1.4213973379473654]
Machine Translation (MT) of the online content is commonly used to process posts written in several languages.
In this paper, we assess the ability of automatic quality metrics to detect critical machine translation errors.
We conclude that there is a need for fine-tuning of automatic metrics to make them more robust in detecting sentiment critical errors.
arXiv Detail & Related papers (2021-09-29T07:51:17Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.