Related papers: A comparison of translation performance between DeepL and Supertext

A comparison of translation performance between DeepL and Supertext

URL: http://arxiv.org/abs/2502.02577v2
Date: Tue, 11 Feb 2025 10:06:14 GMT
Title: A comparison of translation performance between DeepL and Supertext
Authors: Alex Flückiger, Chantal Amrhein, Tim Graf, Frédéric Odermatt, Martin Pömsl, Philippe Schläpfer, Florian Schottmann, Samuel Läubli,
Abstract summary: This study compares two commercial machine translation systems -- DeepL and Supertext.<n>We evaluate translation quality across four language directions with professional translators assessing segments with full document-level context.<n>While segment-level assessments indicate no strong preference between the systems in most cases, document-level analysis reveals a preference for Supertext in three out of four language directions.
Score: 3.858812369171884
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: As strong machine translation (MT) systems are increasingly based on large language models (LLMs), reliable quality benchmarking requires methods that capture their ability to leverage extended context. This study compares two commercial MT systems -- DeepL and Supertext -- by assessing their performance on unsegmented texts. We evaluate translation quality across four language directions with professional translators assessing segments with full document-level context. While segment-level assessments indicate no strong preference between the systems in most cases, document-level analysis reveals a preference for Supertext in three out of four language directions, suggesting superior consistency across longer texts. We advocate for more context-sensitive evaluation methodologies to ensure that MT quality assessments reflect real-world usability. We release all evaluation data and scripts for further analysis and reproduction at https://github.com/supertext/evaluation_deepl_supertext.

Related papers

Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering [68.3400058037817]
We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality. We show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations.
arXiv Detail & Related papers (2025-04-10T09:24:54Z)
Benchmarking GPT-4 against Human Translators: A Comprehensive Evaluation Across Languages, Domains, and Expertise Levels [20.05501751993599]
GPT-4 achieves performance comparable to junior-level translators in terms of total errors. Unlike traditional Neural Machine Translation systems, GPT-4 maintains consistent translation quality across all evaluated language pairs.
arXiv Detail & Related papers (2024-11-21T01:12:46Z)
Is Context Helpful for Chat Translation Evaluation? [23.440392979857247]
We conduct a meta-evaluation of existing sentence-level automatic metrics to assess the quality of machine-translated chats. We find that reference-free metrics lag behind reference-based ones, especially when evaluating translation quality in out-of-English settings. We propose a new evaluation metric, Context-MQM, that utilizes bilingual context with a large language model.
arXiv Detail & Related papers (2024-03-13T07:49:50Z)
Evaluating Optimal Reference Translations [4.956416618428049]
We propose a methodology for creating more reliable document-level human reference translations. We evaluate the obtained document-level optimal reference translations in comparison with "standard" ones.
arXiv Detail & Related papers (2023-11-28T13:50:50Z)
Discourse Centric Evaluation of Machine Translation with a Densely Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al. We investigate the similarities and differences between the discourse structures of source and target languages. We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z)
Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study [63.27346930921658]
ChatGPT is capable of evaluating text quality effectively from various perspectives without reference. The Explicit Score, which utilizes ChatGPT to generate a numeric score measuring text quality, is the most effective and reliable method among the three exploited approaches.
arXiv Detail & Related papers (2023-04-03T05:29:58Z)
Large Language Models are Diverse Role-Players for Summarization Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)
Measuring Uncertainty in Translation Quality Evaluation (TQE) [62.997667081978825]
This work carries out motivated research to correctly estimate the confidence intervals citeBrown_etal2001Interval depending on the sample size of the translated text. The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA)
arXiv Detail & Related papers (2021-11-15T12:09:08Z)
It is Not as Good as You Think! Evaluating Simultaneous Machine Translation on Interpretation Data [58.105938143865906]
We argue that SiMT systems should be trained and tested on real interpretation data. Our results highlight the difference of up-to 13.83 BLEU score when SiMT models are evaluated on translation vs interpretation data.
arXiv Detail & Related papers (2021-10-11T12:27:07Z)
Diving Deep into Context-Aware Neural Machine Translation [36.17847243492193]
This paper analyzes the performance of document-level NMT models on four diverse domains. We find that there is no single best approach to document-level NMT, but rather that different architectures come out on top on different tasks.
arXiv Detail & Related papers (2020-10-19T13:23:12Z)
On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity. This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.