Selecting Parallel In-domain Sentences for Neural Machine Translation
Using Monolingual Texts
- URL: http://arxiv.org/abs/2112.06096v1
- Date: Sat, 11 Dec 2021 23:29:26 GMT
- Title: Selecting Parallel In-domain Sentences for Neural Machine Translation
Using Monolingual Texts
- Authors: Javad Pourmostafa Roshan Sharami, Dimitar Shterionov, Pieter Spronck
- Abstract summary: We propose a method for selecting in-domain data from generic-domain (parallel text) corpora for the task of machine translation.
The proposed method ranks sentences in parallel general-domain data according to their cosine similarity with a monolingual domain-specific data set.
We then select the top K sentences with the highest similarity score to train a new machine translation system tuned to the specific in-domain data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Continuously-growing data volumes lead to larger generic models. Specific
use-cases are usually left out, since generic models tend to perform poorly in
domain-specific cases. Our work addresses this gap with a method for selecting
in-domain data from generic-domain (parallel text) corpora, for the task of
machine translation. The proposed method ranks sentences in parallel
general-domain data according to their cosine similarity with a monolingual
domain-specific data set. We then select the top K sentences with the highest
similarity score to train a new machine translation system tuned to the
specific in-domain data. Our experimental results show that models trained on
this in-domain data outperform models trained on generic or a mixture of
generic and domain data. That is, our method selects high-quality
domain-specific training instances at low computational cost and data size.
Related papers
- Regex-augmented Domain Transfer Topic Classification based on a
Pre-trained Language Model: An application in Financial Domain [42.5087655999509]
We discuss the use of regular expression patterns employed as features for domain knowledge during the process of fine tuning.
Our experiments on real scenario production data show that this method of fine tuning improves the downstream text classification tasks.
arXiv Detail & Related papers (2023-05-23T03:26:32Z) - Domain Adaptation of Machine Translation with Crowdworkers [34.29644521425858]
We propose a framework that efficiently collects parallel sentences in a target domain from the web with the help of crowdworkers.
With the collected parallel data, we can quickly adapt a machine translation model to the target domain.
Our experiments show that the proposed method can collect target-domain parallel data over a few days at a reasonable cost.
arXiv Detail & Related papers (2022-10-28T03:11:17Z) - Domain-Specific Text Generation for Machine Translation [7.803471587734353]
We propose a novel approach to domain adaptation leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation.
We employ mixed fine-tuning to train models that significantly improve translation of in-domain texts.
arXiv Detail & Related papers (2022-08-11T16:22:16Z) - Efficient Hierarchical Domain Adaptation for Pretrained Language Models [77.02962815423658]
Generative language models are trained on diverse, general domain corpora.
We introduce a method to scale domain adaptation to many diverse domains using a computationally efficient adapter approach.
arXiv Detail & Related papers (2021-12-16T11:09:29Z) - Non-Parametric Unsupervised Domain Adaptation for Neural Machine
Translation [61.27321597981737]
$k$NN-MT has shown the promising capability of directly incorporating the pre-trained neural machine translation (NMT) model with domain-specific token-level $k$-nearest-neighbor retrieval.
We propose a novel framework that directly uses in-domain monolingual sentences in the target language to construct an effective datastore for $k$-nearest-neighbor retrieval.
arXiv Detail & Related papers (2021-09-14T11:50:01Z) - Machine Translation Customization via Automatic Training Data Selection
from the Web [97.98885151955467]
We describe an approach for customizing machine translation systems on specific domains.
We select data similar to the target customer data to train neural translation models.
Finally, we train MT models on our automatically selected data, obtaining a system specialized to the target domain.
arXiv Detail & Related papers (2021-02-20T03:29:41Z) - Batch Normalization Embeddings for Deep Domain Generalization [50.51405390150066]
Domain generalization aims at training machine learning models to perform robustly across different and unseen domains.
We show a significant increase in classification accuracy over current state-of-the-art techniques on popular domain generalization benchmarks.
arXiv Detail & Related papers (2020-11-25T12:02:57Z) - Iterative Domain-Repaired Back-Translation [50.32925322697343]
In this paper, we focus on the domain-specific translation with low resources, where in-domain parallel corpora are scarce or nonexistent.
We propose a novel iterative domain-repaired back-translation framework, which introduces the Domain-Repair model to refine translations in synthetic bilingual data.
Experiments on adapting NMT models between specific domains and from the general domain to specific domains demonstrate the effectiveness of our proposed approach.
arXiv Detail & Related papers (2020-10-06T04:38:09Z) - Unsupervised Domain Clusters in Pretrained Language Models [61.832234606157286]
We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision.
We propose domain data selection methods based on such models.
We evaluate our data selection methods for neural machine translation across five diverse domains.
arXiv Detail & Related papers (2020-04-05T06:22:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.