Domain Adaptation of Machine Translation with Crowdworkers
- URL: http://arxiv.org/abs/2210.15861v1
- Date: Fri, 28 Oct 2022 03:11:17 GMT
- Title: Domain Adaptation of Machine Translation with Crowdworkers
- Authors: Makoto Morishita, Jun Suzuki, Masaaki Nagata
- Abstract summary: We propose a framework that efficiently collects parallel sentences in a target domain from the web with the help of crowdworkers.
With the collected parallel data, we can quickly adapt a machine translation model to the target domain.
Our experiments show that the proposed method can collect target-domain parallel data over a few days at a reasonable cost.
- Score: 34.29644521425858
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although a machine translation model trained with a large in-domain parallel
corpus achieves remarkable results, it still works poorly when no in-domain
data are available. This situation restricts the applicability of machine
translation when the target domain's data are limited. However, there is great
demand for high-quality domain-specific machine translation models for many
domains. We propose a framework that efficiently and effectively collects
parallel sentences in a target domain from the web with the help of
crowdworkers. With the collected parallel data, we can quickly adapt a machine
translation model to the target domain. Our experiments show that the proposed
method can collect target-domain parallel data over a few days at a reasonable
cost. We tested it with five domains, and the domain-adapted model improved the
BLEU scores to +19.7 by an average of +7.8 points compared to a general-purpose
translation model.
Related papers
- Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning [55.107329995417786]
Large language models (LLMs) have demonstrated impressive general understanding and generation abilities.
We establish a benchmark for multi-domain translation, featuring 25 German$Leftrightarrow$English and 22 Chinese$Leftrightarrow$English test sets.
We propose a domain Chain of Thought (CoT) fine-tuning technique that utilizes the intrinsic multi-domain intelligence of LLMs to improve translation performance.
arXiv Detail & Related papers (2024-10-03T16:15:04Z) - Domain-Specific Text Generation for Machine Translation [7.803471587734353]
We propose a novel approach to domain adaptation leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation.
We employ mixed fine-tuning to train models that significantly improve translation of in-domain texts.
arXiv Detail & Related papers (2022-08-11T16:22:16Z) - Efficient Hierarchical Domain Adaptation for Pretrained Language Models [77.02962815423658]
Generative language models are trained on diverse, general domain corpora.
We introduce a method to scale domain adaptation to many diverse domains using a computationally efficient adapter approach.
arXiv Detail & Related papers (2021-12-16T11:09:29Z) - Selecting Parallel In-domain Sentences for Neural Machine Translation
Using Monolingual Texts [0.0]
We propose a method for selecting in-domain data from generic-domain (parallel text) corpora for the task of machine translation.
The proposed method ranks sentences in parallel general-domain data according to their cosine similarity with a monolingual domain-specific data set.
We then select the top K sentences with the highest similarity score to train a new machine translation system tuned to the specific in-domain data.
arXiv Detail & Related papers (2021-12-11T23:29:26Z) - Non-Parametric Unsupervised Domain Adaptation for Neural Machine
Translation [61.27321597981737]
$k$NN-MT has shown the promising capability of directly incorporating the pre-trained neural machine translation (NMT) model with domain-specific token-level $k$-nearest-neighbor retrieval.
We propose a novel framework that directly uses in-domain monolingual sentences in the target language to construct an effective datastore for $k$-nearest-neighbor retrieval.
arXiv Detail & Related papers (2021-09-14T11:50:01Z) - Rapid Domain Adaptation for Machine Translation with Monolingual Data [31.70276147485463]
One challenge of machine translation is how to quickly adapt to unseen domains in face of surging events like COVID-19.
In this paper, we propose an approach that enables rapid domain adaptation from the perspective of unsupervised translation.
arXiv Detail & Related papers (2020-10-23T20:31:37Z) - Iterative Domain-Repaired Back-Translation [50.32925322697343]
In this paper, we focus on the domain-specific translation with low resources, where in-domain parallel corpora are scarce or nonexistent.
We propose a novel iterative domain-repaired back-translation framework, which introduces the Domain-Repair model to refine translations in synthetic bilingual data.
Experiments on adapting NMT models between specific domains and from the general domain to specific domains demonstrate the effectiveness of our proposed approach.
arXiv Detail & Related papers (2020-10-06T04:38:09Z) - Addressing Zero-Resource Domains Using Document-Level Context in Neural
Machine Translation [80.40677540516616]
We show that when in-domain parallel data is not available, access to document-level context enables better capturing of domain generalities.
We present two document-level Transformer models which are capable of using large context sizes.
arXiv Detail & Related papers (2020-04-30T16:28:19Z) - A Simple Baseline to Semi-Supervised Domain Adaptation for Machine
Translation [73.3550140511458]
State-of-the-art neural machine translation (NMT) systems are data-hungry and perform poorly on new domains with no supervised data.
We propose a simple but effect approach to the semi-supervised domain adaptation scenario of NMT.
This approach iteratively trains a Transformer-based NMT model via three training objectives: language modeling, back-translation, and supervised translation.
arXiv Detail & Related papers (2020-01-22T16:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.