Domain-Specific Text Generation for Machine Translation
- URL: http://arxiv.org/abs/2208.05909v1
- Date: Thu, 11 Aug 2022 16:22:16 GMT
- Title: Domain-Specific Text Generation for Machine Translation
- Authors: Yasmin Moslem, Rejwanul Haque, John D. Kelleher, Andy Way
- Abstract summary: We propose a novel approach to domain adaptation leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation.
We employ mixed fine-tuning to train models that significantly improve translation of in-domain texts.
- Score: 7.803471587734353
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Preservation of domain knowledge from the source to target is crucial in any
translation workflow. It is common in the translation industry to receive
highly specialized projects, where there is hardly any parallel in-domain data.
In such scenarios where there is insufficient in-domain data to fine-tune
Machine Translation (MT) models, producing translations that are consistent
with the relevant context is challenging. In this work, we propose a novel
approach to domain adaptation leveraging state-of-the-art pretrained language
models (LMs) for domain-specific data augmentation for MT, simulating the
domain characteristics of either (a) a small bilingual dataset, or (b) the
monolingual source text to be translated. Combining this idea with
back-translation, we can generate huge amounts of synthetic bilingual in-domain
data for both use cases. For our investigation, we use the state-of-the-art
Transformer architecture. We employ mixed fine-tuning to train models that
significantly improve translation of in-domain texts. More specifically, in
both scenarios, our proposed methods achieve improvements of approximately 5-6
BLEU and 2-3 BLEU, respectively, on the Arabic-to-English and English-to-Arabic
language pairs. Furthermore, the outcome of human evaluation corroborates the
automatic evaluation results.
Related papers
- Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning [55.107329995417786]
Large language models (LLMs) have demonstrated impressive general understanding and generation abilities.
We establish a benchmark for multi-domain translation, featuring 25 German$Leftrightarrow$English and 22 Chinese$Leftrightarrow$English test sets.
We propose a domain Chain of Thought (CoT) fine-tuning technique that utilizes the intrinsic multi-domain intelligence of LLMs to improve translation performance.
arXiv Detail & Related papers (2024-10-03T16:15:04Z) - Language Modelling Approaches to Adaptive Machine Translation [0.0]
Consistency is a key requirement of high-quality translation.
In-domain data scarcity is common in translation settings.
Can we employ language models to improve the quality of adaptive MT at inference time?
arXiv Detail & Related papers (2024-01-25T23:02:54Z) - Exploiting Language Relatedness in Machine Translation Through Domain
Adaptation Techniques [3.257358540764261]
We present a novel approach of using a scaled similarity score of sentences, especially for related languages based on a 5-gram KenLM language model.
Our approach succeeds in increasing 2 BLEU point on multi-domain approach, 3 BLEU point on fine-tuning for NMT and 2 BLEU point on iterative back-translation approach.
arXiv Detail & Related papers (2023-03-03T09:07:30Z) - Non-Parametric Domain Adaptation for End-to-End Speech Translation [72.37869362559212]
End-to-End Speech Translation (E2E-ST) has received increasing attention due to the potential of its less error propagation, lower latency, and fewer parameters.
We propose a novel non-parametric method that leverages domain-specific text translation corpus to achieve domain adaptation for the E2E-ST system.
arXiv Detail & Related papers (2022-05-23T11:41:02Z) - Non-Parametric Unsupervised Domain Adaptation for Neural Machine
Translation [61.27321597981737]
$k$NN-MT has shown the promising capability of directly incorporating the pre-trained neural machine translation (NMT) model with domain-specific token-level $k$-nearest-neighbor retrieval.
We propose a novel framework that directly uses in-domain monolingual sentences in the target language to construct an effective datastore for $k$-nearest-neighbor retrieval.
arXiv Detail & Related papers (2021-09-14T11:50:01Z) - FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine
Translation [53.87731008029645]
We present a real-world fine-grained domain adaptation task in machine translation (FDMT)
The FDMT dataset consists of four sub-domains of information technology: autonomous vehicles, AI education, real-time networks and smart phone.
We make quantitative experiments and deep analyses in this new setting, which benchmarks the fine-grained domain adaptation task.
arXiv Detail & Related papers (2020-12-31T17:15:09Z) - Domain Adaptation of NMT models for English-Hindi Machine Translation
Task at AdapMT ICON 2020 [2.572404739180802]
This paper describes the neural machine translation systems for the English-Hindi language presented in AdapMT Shared Task ICON 2020.
Our team was ranked first in the chemistry and general domain En-Hi translation task and second in the AI domain En-Hi translation task.
arXiv Detail & Related papers (2020-12-22T15:46:40Z) - Iterative Domain-Repaired Back-Translation [50.32925322697343]
In this paper, we focus on the domain-specific translation with low resources, where in-domain parallel corpora are scarce or nonexistent.
We propose a novel iterative domain-repaired back-translation framework, which introduces the Domain-Repair model to refine translations in synthetic bilingual data.
Experiments on adapting NMT models between specific domains and from the general domain to specific domains demonstrate the effectiveness of our proposed approach.
arXiv Detail & Related papers (2020-10-06T04:38:09Z) - A Simple Baseline to Semi-Supervised Domain Adaptation for Machine
Translation [73.3550140511458]
State-of-the-art neural machine translation (NMT) systems are data-hungry and perform poorly on new domains with no supervised data.
We propose a simple but effect approach to the semi-supervised domain adaptation scenario of NMT.
This approach iteratively trains a Transformer-based NMT model via three training objectives: language modeling, back-translation, and supervised translation.
arXiv Detail & Related papers (2020-01-22T16:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.