Exploring Paracrawl for Document-level Neural Machine Translation
- URL: http://arxiv.org/abs/2304.10216v1
- Date: Thu, 20 Apr 2023 11:21:34 GMT
- Title: Exploring Paracrawl for Document-level Neural Machine Translation
- Authors: Yusser Al Ghussin, Jingyi Zhang, Josef van Genabith
- Abstract summary: Document-level neural machine translation (NMT) has outperformed sentence-level NMT on a number of datasets.
We show that document-level NMT models trained with only parallel paragraphs from Paracrawl can be used to translate real documents.
- Score: 21.923881766940088
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document-level neural machine translation (NMT) has outperformed
sentence-level NMT on a number of datasets. However, document-level NMT is
still not widely adopted in real-world translation systems mainly due to the
lack of large-scale general-domain training data for document-level NMT. We
examine the effectiveness of using Paracrawl for learning document-level
translation. Paracrawl is a large-scale parallel corpus crawled from the
Internet and contains data from various domains. The official Paracrawl corpus
was released as parallel sentences (extracted from parallel webpages) and
therefore previous works only used Paracrawl for learning sentence-level
translation. In this work, we extract parallel paragraphs from Paracrawl
parallel webpages using automatic sentence alignments and we use the extracted
parallel paragraphs as parallel documents for training document-level
translation models. We show that document-level NMT models trained with only
parallel paragraphs from Paracrawl can be used to translate real documents from
TED, News and Europarl, outperforming sentence-level NMT models. We also
perform a targeted pronoun evaluation and show that document-level models
trained with Paracrawl data can help context-aware pronoun translation.
Related papers
- LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries.
Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z) - On Search Strategies for Document-Level Neural Machine Translation [51.359400776242786]
Document-level neural machine translation (NMT) models produce a more consistent output across a document.
In this work, we aim to answer the question how to best utilize a context-aware translation model in decoding.
arXiv Detail & Related papers (2023-06-08T11:30:43Z) - Multilingual Document-Level Translation Enables Zero-Shot Transfer From
Sentences to Documents [19.59133362105703]
Document-level neural machine translation (DocNMT) delivers coherent translations by incorporating cross-sentence context.
We study whether and how contextual modeling in DocNMT is transferable from sentences to documents in a zero-shot fashion.
arXiv Detail & Related papers (2021-09-21T17:49:34Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - Context-aware Decoder for Neural Machine Translation using a Target-side
Document-Level Language Model [12.543106304662059]
We present a method to turn a sentence-level translation model into a context-aware model by incorporating a document-level language model into the decoder.
Our decoder is built upon only a sentence-level parallel corpora and monolingual corpora.
In a theoretical viewpoint, the core part of this work is the novel representation of contextual information using point-wise mutual information between context and the current sentence.
arXiv Detail & Related papers (2020-10-24T08:06:18Z) - A Corpus for English-Japanese Multimodal Neural Machine Translation with
Comparable Sentences [21.43163704217968]
We propose a new multimodal English-Japanese corpus with comparable sentences that are compiled from existing image captioning datasets.
Due to low translation scores in our baseline experiments, we believe that current multimodal NMT models are not designed to effectively utilize comparable sentence data.
arXiv Detail & Related papers (2020-10-17T06:12:25Z) - Unsupervised Parallel Corpus Mining on Web Data [53.74427402568838]
We present a pipeline to mine the parallel corpus from the Internet in an unsupervised manner.
Our system produces new state-of-the-art results, 39.81 and 38.95 BLEU scores, even compared with supervised approaches.
arXiv Detail & Related papers (2020-09-18T02:38:01Z) - Document-level Neural Machine Translation with Document Embeddings [82.4684444847092]
This work focuses on exploiting detailed document-level context in terms of multiple forms of document embeddings.
The proposed document-aware NMT is implemented to enhance the Transformer baseline by introducing both global and local document-level clues on the source end.
arXiv Detail & Related papers (2020-09-16T19:43:29Z) - Using Context in Neural Machine Translation Training Objectives [23.176247496139574]
We present Neural Machine Translation (NMT) training using document-level metrics with batch-level documents.
We demonstrate that training is more robust for document-level metrics than with sequence metrics.
arXiv Detail & Related papers (2020-05-04T13:42:30Z) - Learning Contextualized Sentence Representations for Document-Level
Neural Machine Translation [59.191079800436114]
Document-level machine translation incorporates inter-sentential dependencies into the translation of a source sentence.
We propose a new framework to model cross-sentence dependencies by training neural machine translation (NMT) to predict both the target translation and surrounding sentences of a source sentence.
arXiv Detail & Related papers (2020-03-30T03:38:01Z) - Capturing document context inside sentence-level neural machine
translation models with self-training [5.129814362802968]
Document-level neural machine translation has received less attention and lags behind its sentence-level counterpart.
We propose an approach that doesn't require training a specialized model on parallel document-level corpora.
Our approach reinforces the choices made by the model, thus making it more likely that the same choices will be made in other sentences in the document.
arXiv Detail & Related papers (2020-03-11T12:36:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.