Benchmarking LLMs for Translating Classical Chinese Poetry:Evaluating Adequacy, Fluency, and Elegance
- URL: http://arxiv.org/abs/2408.09945v3
- Date: Thu, 17 Oct 2024 02:17:57 GMT
- Title: Benchmarking LLMs for Translating Classical Chinese Poetry:Evaluating Adequacy, Fluency, and Elegance
- Authors: Andong Chen, Lianzhang Lou, Kehai Chen, Xuefeng Bai, Yang Xiang, Muyun Yang, Tiejun Zhao, Min Zhang,
- Abstract summary: We introduce a suitable benchmark (PoetMT) for translating classical Chinese poetry into English.
This task requires not only adequacy in translating culturally and historically significant content but also a strict adherence to linguistic fluency and poetic elegance.
We propose RAT, a Retrieval-Augmented machine Translation method that enhances the translation process by incorporating knowledge related to classical poetry.
- Score: 43.148203559785095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have shown remarkable performance in translation tasks. However, the increasing demand for high-quality translations that are not only adequate but also fluent and elegant. To evaluate the extent to which current LLMs can meet these demands, we introduce a suitable benchmark (PoetMT) for translating classical Chinese poetry into English. This task requires not only adequacy in translating culturally and historically significant content but also a strict adherence to linguistic fluency and poetic elegance. To overcome the limitations of traditional evaluation metrics, we propose an automatic evaluation metric based on GPT-4, which better evaluates translation quality in terms of adequacy, fluency, and elegance. Our evaluation study reveals that existing large language models fall short in this task. To evaluate these issues, we propose RAT, a Retrieval-Augmented machine Translation method that enhances the translation process by incorporating knowledge related to classical poetry. Our dataset and code will be made available.
Related papers
- The Paradox of Poetic Intent in Back-Translation: Evaluating the Quality of Large Language Models in Chinese Translation [2.685668802278156]
This study constructs a diverse corpus encompassing Chinese scientific terminology, historical translation paradoxes, and literary metaphors.
We evaluate BLEU, CHRF, TER, and semantic similarity metrics across six major large language models (LLMs) and three traditional translation tools.
arXiv Detail & Related papers (2025-04-22T21:48:05Z) - Lost in Literalism: How Supervised Training Shapes Translationese in LLMs [51.04435855143767]
Large language models (LLMs) have achieved remarkable success in machine translation.
However, translationese, characterized by overly literal and unnatural translations, remains a persistent challenge.
We introduce methods to mitigate these biases, including polishing golden references and filtering unnatural training instances.
arXiv Detail & Related papers (2025-03-06T12:14:45Z) - A 2-step Framework for Automated Literary Translation Evaluation: Its Promises and Pitfalls [15.50296318831118]
We propose and evaluate the feasibility of a two-stage pipeline to evaluate literary machine translation.
Our framework provides fine-grained, interpretable metrics suited for literary translation.
arXiv Detail & Related papers (2024-12-02T10:07:01Z) - Language Models and Cycle Consistency for Self-Reflective Machine Translation [1.79487674052027]
We generate multiple translation candidates from a source language A to a target language B, and subsequently translate these candidates back to the original language A.
By evaluating the cycle consistency between the original and back-translated sentences using metrics such as token-level precision and accuracy, we implicitly estimate the translation quality in language B.
For each source sentence, we identify the translation candidate with optimal cycle consistency with the original sentence as the final answer.
arXiv Detail & Related papers (2024-11-05T04:01:41Z) - LLM-based Translation Inference with Iterative Bilingual Understanding [52.46978502902928]
We propose a novel Iterative Bilingual Understanding Translation method based on the cross-lingual capabilities of large language models (LLMs)
The cross-lingual capability of LLMs enables the generation of contextual understanding for both the source and target languages separately.
The proposed IBUT outperforms several strong comparison methods.
arXiv Detail & Related papers (2024-10-16T13:21:46Z) - What is the Best Way for ChatGPT to Translate Poetry? [38.47691441569612]
This study examines ChatGPT's capabilities in English-Chinese poetry translation tasks, utilizing targeted prompts and small sample scenarios to ascertain optimal performance.
We propose an Explanation-Assisted Poetry Machine Translation (EAPMT) method, which leverages monolingual poetry explanation as a guiding information for the translation process.
The results from both human and machine evaluations demonstrate that our EAPMT method outperforms traditional translation methods of ChatGPT and the existing online systems.
arXiv Detail & Related papers (2024-06-05T16:48:26Z) - (Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts [52.18246881218829]
We introduce a novel multi-agent framework based on large language models (LLMs) for literary translation, implemented as a company called TransAgents.
To evaluate the effectiveness of our system, we propose two innovative evaluation strategies: Monolingual Human Preference (MHP) and Bilingual LLM Preference (BLP)
arXiv Detail & Related papers (2024-05-20T05:55:08Z) - Is Context Helpful for Chat Translation Evaluation? [23.440392979857247]
We conduct a meta-evaluation of existing sentence-level automatic metrics to assess the quality of machine-translated chats.
We find that reference-free metrics lag behind reference-based ones, especially when evaluating translation quality in out-of-English settings.
We propose a new evaluation metric, Context-MQM, that utilizes bilingual context with a large language model.
arXiv Detail & Related papers (2024-03-13T07:49:50Z) - Large Language Models "Ad Referendum": How Good Are They at Machine
Translation in the Legal Domain? [0.0]
This study evaluates the machine translation (MT) quality of two state-of-the-art large language models (LLMs) against a tradition-al neural machine translation (NMT) system across four language pairs in the legal domain.
It combines automatic evaluation met-rics (AEMs) and human evaluation (HE) by professional transla-tors to assess translation ranking, fluency and adequacy.
arXiv Detail & Related papers (2024-02-12T14:40:54Z) - Evaluating Optimal Reference Translations [4.956416618428049]
We propose a methodology for creating more reliable document-level human reference translations.
We evaluate the obtained document-level optimal reference translations in comparison with "standard" ones.
arXiv Detail & Related papers (2023-11-28T13:50:50Z) - Contextual Refinement of Translations: Large Language Models for Sentence and Document-Level Post-Editing [12.843274390224853]
Large Language Models (LLM's) have demonstrated considerable success in various Natural Language Processing tasks.
We show that they have yet to attain state-of-the-art performance in Neural Machine Translation.
We propose adapting LLM's as Automatic Post-Editors (APE) rather than direct translators.
arXiv Detail & Related papers (2023-10-23T12:22:15Z) - Translate to Disambiguate: Zero-shot Multilingual Word Sense
Disambiguation with Pretrained Language Models [67.19567060894563]
Pretrained Language Models (PLMs) learn rich cross-lingual knowledge and can be finetuned to perform well on diverse tasks.
We present a new study investigating how well PLMs capture cross-lingual word sense with Contextual Word-Level Translation (C-WLT)
We find that as the model size increases, PLMs encode more cross-lingual word sense knowledge and better use context to improve WLT performance.
arXiv Detail & Related papers (2023-04-26T19:55:52Z) - Large language models effectively leverage document-level context for
literary translation, but critical errors persist [32.54546652197316]
Large language models (LLMs) are competitive with the state of the art on a wide range of sentence-level translation datasets.
We show through a rigorous human evaluation that asking the Gpt-3.5 (text-davinci-003) LLM to translate an entire literary paragraph results in higher-quality translations.
arXiv Detail & Related papers (2023-04-06T17:27:45Z) - Understanding Translationese in Cross-Lingual Summarization [106.69566000567598]
Cross-lingual summarization (MS) aims at generating a concise summary in a different target language.
To collect large-scale CLS data, existing datasets typically involve translation in their creation.
In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
arXiv Detail & Related papers (2022-12-14T13:41:49Z) - Rethink about the Word-level Quality Estimation for Machine Translation
from Human Judgement [57.72846454929923]
We create a benchmark dataset, emphHJQE, where the expert translators directly annotate poorly translated words.
We propose two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE corpus closer to emphHJQE.
The results show our proposed dataset is more consistent with human judgement and also confirm the effectiveness of the proposed tag correcting strategies.
arXiv Detail & Related papers (2022-09-13T02:37:12Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Does Transliteration Help Multilingual Language Modeling? [0.0]
We empirically measure the effect of transliteration on Multilingual Language Models.
We focus on the Indic languages, which have the highest script diversity in the world.
We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages.
arXiv Detail & Related papers (2022-01-29T05:48:42Z) - Measuring Uncertainty in Translation Quality Evaluation (TQE) [62.997667081978825]
This work carries out motivated research to correctly estimate the confidence intervals citeBrown_etal2001Interval depending on the sample size of the translated text.
The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA)
arXiv Detail & Related papers (2021-11-15T12:09:08Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.