Related papers: Benchmarking GPT-4 against Human Translators: A Comprehensive Evaluation Across Languages, Domains, and Expertise Levels

Benchmarking GPT-4 against Human Translators: A Comprehensive Evaluation Across Languages, Domains, and Expertise Levels

URL: http://arxiv.org/abs/2411.13775v1
Date: Thu, 21 Nov 2024 01:12:46 GMT
Title: Benchmarking GPT-4 against Human Translators: A Comprehensive Evaluation Across Languages, Domains, and Expertise Levels
Authors: Jianhao Yan, Pingchuan Yan, Yulong Chen, Jing Li, Xianchao Zhu, Yue Zhang,
Abstract summary: GPT-4 achieves performance comparable to junior-level translators in terms of total errors. Unlike traditional Neural Machine Translation systems, GPT-4 maintains consistent translation quality across all evaluated language pairs.
Score: 20.05501751993599
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study presents a comprehensive evaluation of GPT-4's translation capabilities compared to human translators of varying expertise levels. Through systematic human evaluation using the MQM schema, we assess translations across three language pairs (Chinese$\longleftrightarrow$English, Russian$\longleftrightarrow$English, and Chinese$\longleftrightarrow$Hindi) and three domains (News, Technology, and Biomedical). Our findings reveal that GPT-4 achieves performance comparable to junior-level translators in terms of total errors, while still lagging behind senior translators. Unlike traditional Neural Machine Translation systems, which show significant performance degradation in resource-poor language directions, GPT-4 maintains consistent translation quality across all evaluated language pairs. Through qualitative analysis, we identify distinctive patterns in translation approaches: GPT-4 tends toward overly literal translations and exhibits lexical inconsistency, while human translators sometimes over-interpret context and introduce hallucinations. This study represents the first systematic comparison between LLM and human translators across different proficiency levels, providing valuable insights into the current capabilities and limitations of LLM-based translation systems.

Related papers

Missing the human touch? A computational stylometry analysis of GPT-4 translations of online Chinese literature [2.3861843983281625]
This study examines the stylistic features of large language models (LLMs)<n> Computational stylometry analysis shows that GPT-4 translations closely align with human translations in lexical, syntactic, and content features.<n>These findings offer insights into AI's impact on literary translation from a posthuman perspective.
arXiv Detail & Related papers (2025-06-16T00:48:09Z)
Lost in Literalism: How Supervised Training Shapes Translationese in LLMs [51.04435855143767]
Large language models (LLMs) have achieved remarkable success in machine translation. However, translationese, characterized by overly literal and unnatural translations, remains a persistent challenge. We introduce methods to mitigate these biases, including polishing golden references and filtering unnatural training instances.
arXiv Detail & Related papers (2025-03-06T12:14:45Z)
A comparison of translation performance between DeepL and Supertext [3.858812369171884]
This study compares two commercial machine translation systems -- DeepL and Supertext. We evaluate translation quality across four language directions with professional translators assessing segments with full document-level context. While segment-level assessments indicate no strong preference between the systems in most cases, document-level analysis reveals a preference for Supertext in three out of four language directions.
arXiv Detail & Related papers (2025-02-04T18:53:42Z)
How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs [23.247387152595067]
LITEVAL-CORPUS is a parallel corpus comprising multiple verified human translations and outputs from 9 machine translation systems. We find that Multidimensional Quality Metrics (MQM), as the de facto standard in non-literary human MT evaluation, is inadequate for literary translation.
arXiv Detail & Related papers (2024-10-24T12:48:03Z)
LLM-based Translation Inference with Iterative Bilingual Understanding [52.46978502902928]
We propose a novel Iterative Bilingual Understanding Translation method based on the cross-lingual capabilities of large language models (LLMs) The cross-lingual capability of LLMs enables the generation of contextual understanding for both the source and target languages separately. The proposed IBUT outperforms several strong comparison methods.
arXiv Detail & Related papers (2024-10-16T13:21:46Z)
GPT-4 vs. Human Translators: A Comprehensive Evaluation of Translation Quality Across Languages, Domains, and Expertise Levels [18.835573312027265]
This study comprehensively evaluates the translation quality of Large Language Models (LLMs) against human translators. We find that GPT-4 performs comparably to junior translators in terms of total errors made but lags behind medium and senior translators.
arXiv Detail & Related papers (2024-07-04T05:58:04Z)
(Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts [52.18246881218829]
We introduce a novel multi-agent framework based on large language models (LLMs) for literary translation, implemented as a company called TransAgents. To evaluate the effectiveness of our system, we propose two innovative evaluation strategies: Monolingual Human Preference (MHP) and Bilingual LLM Preference (BLP)
arXiv Detail & Related papers (2024-05-20T05:55:08Z)
Large Language Models "Ad Referendum": How Good Are They at Machine Translation in the Legal Domain? [0.0]
This study evaluates the machine translation (MT) quality of two state-of-the-art large language models (LLMs) against a tradition-al neural machine translation (NMT) system across four language pairs in the legal domain. It combines automatic evaluation met-rics (AEMs) and human evaluation (HE) by professional transla-tors to assess translation ranking, fluency and adequacy.
arXiv Detail & Related papers (2024-02-12T14:40:54Z)
On the Evaluation of Neural Code Translation: Taxonomy and Benchmark [12.431884660186281]
We develop a taxonomy that categorizes code translation tasks into four primary types according to their complexity and knowledge dependence. We then conduct a thorough analysis of how existing approaches perform across these four categories. Our findings indicate that while state-of-the-art code translation models excel in type-1 and type-2 translations, they struggle with knowledge-dependent ones such as type-3 and type-4.
arXiv Detail & Related papers (2023-08-17T13:05:27Z)
Discourse Centric Evaluation of Machine Translation with a Densely Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al. We investigate the similarities and differences between the discourse structures of source and target languages. We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z)
Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis [103.89753784762445]
Large language models (LLMs) have demonstrated remarkable potential in handling multilingual machine translation (MMT) This paper systematically investigates the advantages and challenges of LLMs for MMT. We thoroughly evaluate eight popular LLMs, including ChatGPT and GPT-4.
arXiv Detail & Related papers (2023-04-10T15:51:30Z)
Large language models effectively leverage document-level context for literary translation, but critical errors persist [32.54546652197316]
Large language models (LLMs) are competitive with the state of the art on a wide range of sentence-level translation datasets. We show through a rigorous human evaluation that asking the Gpt-3.5 (text-davinci-003) LLM to translate an entire literary paragraph results in higher-quality translations.
arXiv Detail & Related papers (2023-04-06T17:27:45Z)
Document-Level Machine Translation with Large Language Models [91.03359121149595]
Large language models (LLMs) can produce coherent, cohesive, relevant, and fluent answers for various natural language processing (NLP) tasks. This paper provides an in-depth evaluation of LLMs' ability on discourse modeling.
arXiv Detail & Related papers (2023-04-05T03:49:06Z)
Decoding and Diversity in Machine Translation [90.33636694717954]
We characterize differences between cost diversity paid for the BLEU scores enjoyed by NMT. Our study implicates search as a salient source of known bias when translating gender pronouns.
arXiv Detail & Related papers (2020-11-26T21:09:38Z)
On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity. This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.