Domain Terminology Integration into Machine Translation: Leveraging
Large Language Models
- URL: http://arxiv.org/abs/2310.14451v1
- Date: Sun, 22 Oct 2023 23:25:28 GMT
- Title: Domain Terminology Integration into Machine Translation: Leveraging
Large Language Models
- Authors: Yasmin Moslem, Gianfranco Romani, Mahdi Molaei, Rejwanul Haque, John
D. Kelleher, Andy Way
- Abstract summary: This paper discusses the methods that we used for our submissions to the WMT 2023 Terminology Shared Task for German-to-English (DE-EN), English-to-Czech (EN-CS), and Chinese-to-English (ZH-EN) language pairs.
The task aims to advance machine translation (MT) by challenging participants to develop systems that accurately translate technical terms.
- Score: 3.178046741931973
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper discusses the methods that we used for our submissions to the WMT
2023 Terminology Shared Task for German-to-English (DE-EN), English-to-Czech
(EN-CS), and Chinese-to-English (ZH-EN) language pairs. The task aims to
advance machine translation (MT) by challenging participants to develop systems
that accurately translate technical terms, ultimately enhancing communication
and understanding in specialised domains. To this end, we conduct experiments
that utilise large language models (LLMs) for two purposes: generating
synthetic bilingual terminology-based data, and post-editing translations
generated by an MT model through incorporating pre-approved terms. Our system
employs a four-step process: (i) using an LLM to generate bilingual synthetic
data based on the provided terminology, (ii) fine-tuning a generic
encoder-decoder MT model, with a mix of the terminology-based synthetic data
generated in the first step and a randomly sampled portion of the original
generic training data, (iii) generating translations with the fine-tuned MT
model, and (iv) finally, leveraging an LLM for terminology-constrained
automatic post-editing of the translations that do not include the required
terms. The results demonstrate the effectiveness of our proposed approach in
improving the integration of pre-approved terms into translations. The number
of terms incorporated into the translations of the blind dataset increases from
an average of 36.67% with the generic model to an average of 72.88% by the end
of the process. In other words, successful utilisation of terms nearly doubles
across the three language pairs.
Related papers
- Efficient Terminology Integration for LLM-based Translation in Specialized Domains [0.0]
In specialized fields such as patent, finance, or biomedical domains, terminology is crucial for translation.
We introduce a methodology that efficiently trains models with a smaller amount of data while preserving the accuracy of terminology translation.
This methodology enhances the model's ability to handle specialized terminology and ensures high-quality translations.
arXiv Detail & Related papers (2024-10-21T07:01:25Z) - Efficient Technical Term Translation: A Knowledge Distillation Approach for Parenthetical Terminology Translation [0.0]
This paper addresses the challenge of accurately translating technical terms, which are crucial for clear communication in specialized fields.
We introduce the Parenthetical Terminology Translation (PTT) task, designed to mitigate potential inaccuracies by displaying the original term in parentheses alongside its translation.
We developed a novel evaluation metric to assess both overall translation accuracy and the correct parenthetical presentation of terms.
arXiv Detail & Related papers (2024-10-01T13:40:28Z) - Towards Zero-Shot Multimodal Machine Translation [64.9141931372384]
We propose a method to bypass the need for fully supervised data to train multimodal machine translation systems.
Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives.
To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese.
arXiv Detail & Related papers (2024-07-18T15:20:31Z) - TasTe: Teaching Large Language Models to Translate through Self-Reflection [82.83958470745381]
Large language models (LLMs) have exhibited remarkable performance in various natural language processing tasks.
We propose the TasTe framework, which stands for translating through self-reflection.
The evaluation results in four language directions on the WMT22 benchmark reveal the effectiveness of our approach compared to existing methods.
arXiv Detail & Related papers (2024-06-12T17:21:21Z) - Towards Effective Disambiguation for Machine Translation with Large
Language Models [65.80775710657672]
We study the capabilities of large language models to translate "ambiguous sentences"
Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions.
arXiv Detail & Related papers (2023-09-20T22:22:52Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Neural Machine Translation with Contrastive Translation Memories [71.86990102704311]
Retrieval-augmented Neural Machine Translation models have been successful in many translation scenarios.
We propose a new retrieval-augmented NMT to model contrastively retrieved translation memories that are holistically similar to the source sentence.
In training phase, a Multi-TM contrastive learning objective is introduced to learn salient feature of each TM with respect to target sentence.
arXiv Detail & Related papers (2022-12-06T17:10:17Z) - Domain-Specific Text Generation for Machine Translation [7.803471587734353]
We propose a novel approach to domain adaptation leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation.
We employ mixed fine-tuning to train models that significantly improve translation of in-domain texts.
arXiv Detail & Related papers (2022-08-11T16:22:16Z) - Lingua Custodia's participation at the WMT 2021 Machine Translation
using Terminologies shared task [3.3108924994485096]
We consider three directions, namely English to French, Russian, and Chinese.
We introduce two main changes to the standard procedure to handle terminologies.
Our method satisfies most terminology constraints while maintaining high translation quality.
arXiv Detail & Related papers (2021-11-03T10:36:32Z) - Multilingual Machine Translation Systems from Microsoft for WMT21 Shared
Task [95.06453182273027]
This report describes Microsoft's machine translation systems for the WMT21 shared task on large-scale multilingual machine translation.
Our model submissions to the shared task were with DeltaLMnotefooturlhttps://aka.ms/deltalm, a generic pre-trained multilingual-decoder model.
Our final submissions ranked first on three tracks in terms of the automatic evaluation metric.
arXiv Detail & Related papers (2021-11-03T09:16:17Z) - CUNI systems for WMT21: Terminology translation Shared Task [0.0]
The objective of this task is to design a system which translates certain terms based on a provided terminology database.
Our approach is based on providing the desired translations alongside the input sentence and training the model to use these provided terms.
We lemmatize the terms both during the training and inference, to allow the model to learn how to produce correct surface forms of the words.
arXiv Detail & Related papers (2021-09-20T08:05:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.