Translation Transformers Rediscover Inherent Data Domains
- URL: http://arxiv.org/abs/2109.07864v1
- Date: Thu, 16 Sep 2021 10:58:13 GMT
- Title: Translation Transformers Rediscover Inherent Data Domains
- Authors: Maksym Del, Elizaveta Korotkova, Mark Fishel
- Abstract summary: We analyze the sentence representations learned by NMT Transformers and show that these explicitly include the information on text domains.
We show that this internal information is enough to cluster sentences by their underlying domains without supervision.
We show that NMT models produce clusters better aligned to the actual domains compared to pre-trained language models (LMs)
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many works proposed methods to improve the performance of Neural Machine
Translation (NMT) models in a domain/multi-domain adaptation scenario. However,
an understanding of how NMT baselines represent text domain information
internally is still lacking. Here we analyze the sentence representations
learned by NMT Transformers and show that these explicitly include the
information on text domains, even after only seeing the input sentences without
domains labels. Furthermore, we show that this internal information is enough
to cluster sentences by their underlying domains without supervision. We show
that NMT models produce clusters better aligned to the actual domains compared
to pre-trained language models (LMs). Notably, when computed on document-level,
NMT cluster-to-domain correspondence nears 100%. We use these findings together
with an approach to NMT domain adaptation using automatically extracted
domains. Whereas previous work relied on external LMs for text clustering, we
propose re-using the NMT model as a source of unsupervised clusters. We perform
an extensive experimental study comparing two approaches across two data
scenarios, three language pairs, and both sentence-level and document-level
clustering, showing equal or significantly superior performance compared to
LMs.
Related papers
- Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning [55.107329995417786]
Large language models (LLMs) have demonstrated impressive general understanding and generation abilities.
We establish a benchmark for multi-domain translation, featuring 25 German$Leftrightarrow$English and 22 Chinese$Leftrightarrow$English test sets.
We propose a domain Chain of Thought (CoT) fine-tuning technique that utilizes the intrinsic multi-domain intelligence of LLMs to improve translation performance.
arXiv Detail & Related papers (2024-10-03T16:15:04Z) - Bridging the Domain Gaps in Context Representations for k-Nearest
Neighbor Neural Machine Translation [57.49095610777317]
$k$-Nearest neighbor machine translation ($k$NN-MT) has attracted increasing attention due to its ability to non-parametrically adapt to new translation domains.
We propose a novel approach to boost the datastore retrieval of $k$NN-MT by reconstructing the original datastore.
Our method can effectively boost the datastore retrieval and translation quality of $k$NN-MT.
arXiv Detail & Related papers (2023-05-26T03:04:42Z) - Exploiting Language Relatedness in Machine Translation Through Domain
Adaptation Techniques [3.257358540764261]
We present a novel approach of using a scaled similarity score of sentences, especially for related languages based on a 5-gram KenLM language model.
Our approach succeeds in increasing 2 BLEU point on multi-domain approach, 3 BLEU point on fine-tuning for NMT and 2 BLEU point on iterative back-translation approach.
arXiv Detail & Related papers (2023-03-03T09:07:30Z) - Domain-Specific Text Generation for Machine Translation [7.803471587734353]
We propose a novel approach to domain adaptation leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation.
We employ mixed fine-tuning to train models that significantly improve translation of in-domain texts.
arXiv Detail & Related papers (2022-08-11T16:22:16Z) - Non-Parametric Unsupervised Domain Adaptation for Neural Machine
Translation [61.27321597981737]
$k$NN-MT has shown the promising capability of directly incorporating the pre-trained neural machine translation (NMT) model with domain-specific token-level $k$-nearest-neighbor retrieval.
We propose a novel framework that directly uses in-domain monolingual sentences in the target language to construct an effective datastore for $k$-nearest-neighbor retrieval.
arXiv Detail & Related papers (2021-09-14T11:50:01Z) - Iterative Domain-Repaired Back-Translation [50.32925322697343]
In this paper, we focus on the domain-specific translation with low resources, where in-domain parallel corpora are scarce or nonexistent.
We propose a novel iterative domain-repaired back-translation framework, which introduces the Domain-Repair model to refine translations in synthetic bilingual data.
Experiments on adapting NMT models between specific domains and from the general domain to specific domains demonstrate the effectiveness of our proposed approach.
arXiv Detail & Related papers (2020-10-06T04:38:09Z) - Addressing Zero-Resource Domains Using Document-Level Context in Neural
Machine Translation [80.40677540516616]
We show that when in-domain parallel data is not available, access to document-level context enables better capturing of domain generalities.
We present two document-level Transformer models which are capable of using large context sizes.
arXiv Detail & Related papers (2020-04-30T16:28:19Z) - A Simple Baseline to Semi-Supervised Domain Adaptation for Machine
Translation [73.3550140511458]
State-of-the-art neural machine translation (NMT) systems are data-hungry and perform poorly on new domains with no supervised data.
We propose a simple but effect approach to the semi-supervised domain adaptation scenario of NMT.
This approach iteratively trains a Transformer-based NMT model via three training objectives: language modeling, back-translation, and supervised translation.
arXiv Detail & Related papers (2020-01-22T16:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.