Multilingual Document-Level Translation Enables Zero-Shot Transfer From
Sentences to Documents
- URL: http://arxiv.org/abs/2109.10341v1
- Date: Tue, 21 Sep 2021 17:49:34 GMT
- Title: Multilingual Document-Level Translation Enables Zero-Shot Transfer From
Sentences to Documents
- Authors: Biao Zhang, Ankur Bapna, Melvin Johnson, Ali Dabirmoghaddam, Naveen
Arivazhagan, Orhan Firat
- Abstract summary: Document-level neural machine translation (DocNMT) delivers coherent translations by incorporating cross-sentence context.
We study whether and how contextual modeling in DocNMT is transferable from sentences to documents in a zero-shot fashion.
- Score: 19.59133362105703
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document-level neural machine translation (DocNMT) delivers coherent
translations by incorporating cross-sentence context. However, for most
language pairs there's a shortage of parallel documents, although parallel
sentences are readily available. In this paper, we study whether and how
contextual modeling in DocNMT is transferable from sentences to documents in a
zero-shot fashion (i.e. no parallel documents for student languages) through
multilingual modeling. Using simple concatenation-based DocNMT, we explore the
effect of 3 factors on multilingual transfer: the number of document-supervised
teacher languages, the data schedule for parallel documents at training, and
the data condition of parallel documents (genuine vs. backtranslated). Our
experiments on Europarl-7 and IWSLT-10 datasets show the feasibility of
multilingual transfer for DocNMT, particularly on document-specific metrics. We
observe that more teacher languages and adequate data schedule both contribute
to better transfer quality. Surprisingly, the transfer is less sensitive to the
data condition and multilingual DocNMT achieves comparable performance with
both back-translated and genuine document pairs.
Related papers
- A General-Purpose Multilingual Document Encoder [9.868221447090855]
We pretrain a massively multilingual document encoder as a hierarchical transformer model (HMDE)
We leverage Wikipedia as a readily available source of comparable documents for creating training data.
We evaluate the effectiveness of HMDE in two arguably most common and prominent cross-lingual document-level tasks.
arXiv Detail & Related papers (2023-05-11T17:55:45Z) - Exploring Paracrawl for Document-level Neural Machine Translation [21.923881766940088]
Document-level neural machine translation (NMT) has outperformed sentence-level NMT on a number of datasets.
We show that document-level NMT models trained with only parallel paragraphs from Paracrawl can be used to translate real documents.
arXiv Detail & Related papers (2023-04-20T11:21:34Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - Realistic Zero-Shot Cross-Lingual Transfer in Legal Topic Classification [21.44895570621707]
We consider zero-shot cross-lingual transfer in legal topic classification using the recent MultiEURLEX dataset.
Since the original dataset contains parallel documents, which is unrealistic for zero-shot cross-lingual transfer, we develop a new version of the dataset without parallel documents.
We show that translation-based methods vastly outperform cross-lingual fine-tuning of multilingually pre-trained models.
arXiv Detail & Related papers (2022-06-08T10:02:11Z) - Parameter-Efficient Neural Reranking for Cross-Lingual and Multilingual
Retrieval [66.69799641522133]
State-of-the-art neural (re)rankers are notoriously data hungry.
Current approaches typically transfer rankers trained on English data to other languages and cross-lingual setups by means of multilingual encoders.
We show that two parameter-efficient approaches to cross-lingual transfer, namely Sparse Fine-Tuning Masks (SFTMs) and Adapters, allow for a more lightweight and more effective zero-shot transfer.
arXiv Detail & Related papers (2022-04-05T15:44:27Z) - DOCmT5: Document-Level Pretraining of Multilingual Language Models [9.072507490639218]
We introduce DOCmT5, a multilingual sequence-to-sequence language model pre-trained with large scale parallel documents.
We propose a simple and effective pre-training objective - Document Reordering Machine Translation.
DrMT brings consistent improvements over strong baselines on a variety of document-level generation tasks.
arXiv Detail & Related papers (2021-12-16T08:58:52Z) - Multilingual Machine Translation Systems from Microsoft for WMT21 Shared
Task [95.06453182273027]
This report describes Microsoft's machine translation systems for the WMT21 shared task on large-scale multilingual machine translation.
Our model submissions to the shared task were with DeltaLMnotefooturlhttps://aka.ms/deltalm, a generic pre-trained multilingual-decoder model.
Our final submissions ranked first on three tracks in terms of the automatic evaluation metric.
arXiv Detail & Related papers (2021-11-03T09:16:17Z) - Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking [84.50302759362698]
We enhance the transfer learning process by intermediate fine-tuning of pretrained multilingual models.
We use parallel and conversational movie subtitles datasets to design cross-lingual intermediate tasks.
We achieve impressive improvements (> 20% on goal accuracy) on the parallel MultiWoZ dataset and Multilingual WoZ dataset.
arXiv Detail & Related papers (2021-09-28T11:22:38Z) - MultiEURLEX -- A multi-lingual and multi-label legal document
classification dataset for zero-shot cross-lingual transfer [13.24356999779404]
We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents.
The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy.
We use the dataset as a testbed for zero-shot cross-lingual transfer, where we exploit annotated training documents in one language (source) to classify documents in another language (target)
arXiv Detail & Related papers (2021-09-02T12:52:55Z) - Word Alignment by Fine-tuning Embeddings on Parallel Corpora [96.28608163701055]
Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs.
Recently, other work has demonstrated that pre-trained contextualized word embeddings derived from multilingually trained language models (LMs) prove an attractive alternative, achieving competitive results on the word alignment task even in the absence of explicit training on parallel data.
In this paper, we examine methods to marry the two approaches: leveraging pre-trained LMs but fine-tuning them on parallel text with objectives designed to improve alignment quality, and proposing
arXiv Detail & Related papers (2021-01-20T17:54:47Z) - What makes multilingual BERT multilingual? [60.9051207862378]
In this work, we provide an in-depth experimental study to supplement the existing literature of cross-lingual ability.
We compare the cross-lingual ability of non-contextualized and contextualized representation model with the same data.
We found that datasize and context window size are crucial factors to the transferability.
arXiv Detail & Related papers (2020-10-20T05:41:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.