How Good Are GPT Models at Machine Translation? A Comprehensive
Evaluation
- URL: http://arxiv.org/abs/2302.09210v1
- Date: Sat, 18 Feb 2023 02:11:36 GMT
- Title: How Good Are GPT Models at Machine Translation? A Comprehensive
Evaluation
- Authors: Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr,
Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, Hany Hassan Awadalla
- Abstract summary: We show that GPT models achieve very competitive translation quality for high resource languages.
We also show that hybrid approaches, which combine GPT models with other translation systems, can further enhance the translation quality.
- Score: 16.90012234231392
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative Pre-trained Transformer (GPT) models have shown remarkable
capabilities for natural language generation, but their performance for machine
translation has not been thoroughly investigated. In this paper, we present a
comprehensive evaluation of GPT models for machine translation, covering
various aspects such as quality of different GPT models in comparison with
state-of-the-art research and commercial systems, effect of prompting
strategies, robustness towards domain shifts and document-level translation. We
experiment with eighteen different translation directions involving high and
low resource languages, as well as non English-centric translations, and
evaluate the performance of three GPT models: ChatGPT, GPT3.5
(text-davinci-003), and text-davinci-002. Our results show that GPT models
achieve very competitive translation quality for high resource languages, while
having limited capabilities for low resource languages. We also show that
hybrid approaches, which combine GPT models with other translation systems, can
further enhance the translation quality. We perform comprehensive analysis and
human evaluation to further understand the characteristics of GPT translations.
We hope that our paper provides valuable insights for researchers and
practitioners in the field and helps to better understand the potential and
limitations of GPT models for translation.
Related papers
- Benchmarking GPT-4 against Human Translators: A Comprehensive Evaluation Across Languages, Domains, and Expertise Levels [20.05501751993599]
GPT-4 achieves performance comparable to junior-level translators in terms of total errors.
Unlike traditional Neural Machine Translation systems, GPT-4 maintains consistent translation quality across all evaluated language pairs.
arXiv Detail & Related papers (2024-11-21T01:12:46Z) - Lost in the Source Language: How Large Language Models Evaluate the Quality of Machine Translation [64.5862977630713]
This study investigates how Large Language Models (LLMs) leverage source and reference data in machine translation evaluation task.
We find that reference information significantly enhances the evaluation accuracy, while surprisingly, source information sometimes is counterproductive.
arXiv Detail & Related papers (2024-01-12T13:23:21Z) - Do GPTs Produce Less Literal Translations? [20.095646048167612]
Large Language Models (LLMs) have emerged as general-purpose language models capable of addressing many natural language generation or understanding tasks.
We find that translations out of English (E-X) from GPTs tend to be less literal, while exhibiting similar or better scores on Machine Translation quality metrics.
arXiv Detail & Related papers (2023-05-26T10:38:31Z) - Document-Level Machine Translation with Large Language Models [91.03359121149595]
Large language models (LLMs) can produce coherent, cohesive, relevant, and fluent answers for various natural language processing (NLP) tasks.
This paper provides an in-depth evaluation of LLMs' ability on discourse modeling.
arXiv Detail & Related papers (2023-04-05T03:49:06Z) - How to Design Translation Prompts for ChatGPT: An Empirical Study [18.678893287863033]
ChatGPT has demonstrated surprising abilities in natural language understanding and natural language generation.
We adopt several translation prompts on a wide range of translations.
Our work provides empirical evidence that ChatGPT still has great potential in translations.
arXiv Detail & Related papers (2023-04-05T01:17:59Z) - Analyzing the Performance of GPT-3.5 and GPT-4 in Grammatical Error
Correction [28.58384091374763]
GPT-3 and GPT-4 models are powerful, achieving high performance on a variety of Natural Language Processing tasks.
We perform experiments testing the capabilities of a GPT-3.5 model (text-davinci-003) and a GPT-4 model (gpt-4-0314) on major GEC benchmarks.
We report the performance of our best prompt on the BEA-2019 and JFLEG datasets, finding that the GPT models can perform well in a sentence-level revision setting.
arXiv Detail & Related papers (2023-03-25T03:08:49Z) - A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models [71.42197262495056]
GPT series models have gained considerable attention due to their exceptional natural language processing capabilities.
We select six representative models, comprising two GPT-3 series models and four GPT-3.5 series models.
We evaluate their performance on nine natural language understanding (NLU) tasks using 21 datasets.
Our experiments reveal that the overall ability of GPT series models on NLU tasks does not increase gradually as the models evolve.
arXiv Detail & Related papers (2023-03-18T14:02:04Z) - Large Language Models Are State-of-the-Art Evaluators of Translation
Quality [7.818228526742237]
GEMBA is a GPT-based metric for assessment of translation quality.
We investigate nine versions of GPT models, including ChatGPT and GPT-4.
Our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels.
arXiv Detail & Related papers (2023-02-28T12:23:48Z) - News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z) - Elaboration-Generating Commonsense Question Answering at Scale [77.96137534751445]
In question answering requiring common sense, language models (e.g., GPT-3) have been used to generate text expressing background knowledge.
We finetune smaller language models to generate useful intermediate context, referred to here as elaborations.
Our framework alternates between updating two language models -- an elaboration generator and an answer predictor -- allowing each to influence the other.
arXiv Detail & Related papers (2022-09-02T18:32:09Z) - Improving Multilingual Translation by Representation and Gradient
Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level.
Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.