FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine
Translation
- URL: http://arxiv.org/abs/2012.15717v1
- Date: Thu, 31 Dec 2020 17:15:09 GMT
- Title: FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine
Translation
- Authors: Wenhao Zhu, Shujian Huang, Tong Pu, Xu Zhang, Jian Yu, Wei Chen,
Yanfeng Wang and Jiajun Chen
- Abstract summary: We present a real-world fine-grained domain adaptation task in machine translation (FDMT)
The FDMT dataset consists of four sub-domains of information technology: autonomous vehicles, AI education, real-time networks and smart phone.
We make quantitative experiments and deep analyses in this new setting, which benchmarks the fine-grained domain adaptation task.
- Score: 53.87731008029645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Previous domain adaptation research usually neglect the diversity in
translation within a same domain, which is a core problem for adapting a
general neural machine translation (NMT) model into a specific domain in
real-world scenarios. One representative of such challenging scenarios is to
deploy a translation system for a conference with a specific topic, e.g.
computer networks or natural language processing, where there is usually
extremely less resources due to the limited time schedule. To motivate a wide
investigation in such settings, we present a real-world fine-grained domain
adaptation task in machine translation (FDMT). The FDMT dataset (Zh-En)
consists of four sub-domains of information technology: autonomous vehicles, AI
education, real-time networks and smart phone. To be closer to reality, FDMT
does not employ any in-domain bilingual training data. Instead, each sub-domain
is equipped with monolingual data, bilingual dictionary and knowledge base, to
encourage in-depth exploration of these available resources. Corresponding
development set and test set are provided for evaluation purpose. We make
quantitative experiments and deep analyses in this new setting, which
benchmarks the fine-grained domain adaptation task and reveals several
challenging problems that need to be addressed.
Related papers
- DG-PIC: Domain Generalized Point-In-Context Learning for Point Cloud Understanding [41.49771026674969]
We introduce a novel, practical, multi-domain multi-task setting, handling multiple domains and multiple tasks within one unified model for domain generalized point cloud understanding.
Our DG-PIC does not require any model updates during the testing and can handle unseen domains and multiple tasks, textiti.e., point cloud reconstruction, denoising, and registration, within one unified model.
arXiv Detail & Related papers (2024-07-11T18:21:40Z) - A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation [52.0964459842176]
Current state-of-the-art dialogue systems heavily rely on extensive training datasets.
We propose a novel data textbfAugmentation framework for textbfMulti-textbfDomain textbfDialogue textbfGeneration, referred to as textbfAMD$2$G.
The AMD$2$G framework consists of a data augmentation process and a two-stage training approach: domain-agnostic training and domain adaptation training.
arXiv Detail & Related papers (2024-06-14T09:52:27Z) - Language Modelling Approaches to Adaptive Machine Translation [0.0]
Consistency is a key requirement of high-quality translation.
In-domain data scarcity is common in translation settings.
Can we employ language models to improve the quality of adaptive MT at inference time?
arXiv Detail & Related papers (2024-01-25T23:02:54Z) - Harnessing the Power of Multi-Lingual Datasets for Pre-training: Towards
Enhancing Text Spotting Performance [15.513912470752041]
The adaptation capability to a wide range of domains is crucial for scene text spotting models when deployed to real-world conditions.
Here, we investigate the problem of domain-adaptive scene text spotting, i.e., training a model on multi-domain source data.
The results clearly demonstrate the potential of intermediate representations to achieve significant performance on text spotting benchmarks across multiple domains.
arXiv Detail & Related papers (2023-10-02T06:08:01Z) - $m^4Adapter$: Multilingual Multi-Domain Adaptation for Machine
Translation with a Meta-Adapter [128.69723410769586]
Multilingual neural machine translation models (MNMT) yield state-of-the-art performance when evaluated on data from a domain and language pair.
When a MNMT model is used to translate under domain shift or to a new language pair, performance drops dramatically.
We propose $m4Adapter$, which combines domain and language knowledge using meta-learning with adapters.
arXiv Detail & Related papers (2022-10-21T12:25:05Z) - Domain-Specific Text Generation for Machine Translation [7.803471587734353]
We propose a novel approach to domain adaptation leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation.
We employ mixed fine-tuning to train models that significantly improve translation of in-domain texts.
arXiv Detail & Related papers (2022-08-11T16:22:16Z) - Non-Parametric Unsupervised Domain Adaptation for Neural Machine
Translation [61.27321597981737]
$k$NN-MT has shown the promising capability of directly incorporating the pre-trained neural machine translation (NMT) model with domain-specific token-level $k$-nearest-neighbor retrieval.
We propose a novel framework that directly uses in-domain monolingual sentences in the target language to construct an effective datastore for $k$-nearest-neighbor retrieval.
arXiv Detail & Related papers (2021-09-14T11:50:01Z) - Rapid Domain Adaptation for Machine Translation with Monolingual Data [31.70276147485463]
One challenge of machine translation is how to quickly adapt to unseen domains in face of surging events like COVID-19.
In this paper, we propose an approach that enables rapid domain adaptation from the perspective of unsupervised translation.
arXiv Detail & Related papers (2020-10-23T20:31:37Z) - Unsupervised Domain Clusters in Pretrained Language Models [61.832234606157286]
We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision.
We propose domain data selection methods based on such models.
We evaluate our data selection methods for neural machine translation across five diverse domains.
arXiv Detail & Related papers (2020-04-05T06:22:16Z) - A Simple Baseline to Semi-Supervised Domain Adaptation for Machine
Translation [73.3550140511458]
State-of-the-art neural machine translation (NMT) systems are data-hungry and perform poorly on new domains with no supervised data.
We propose a simple but effect approach to the semi-supervised domain adaptation scenario of NMT.
This approach iteratively trains a Transformer-based NMT model via three training objectives: language modeling, back-translation, and supervised translation.
arXiv Detail & Related papers (2020-01-22T16:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.