Parallel Corpus Filtering via Pre-trained Language Models
- URL: http://arxiv.org/abs/2005.06166v1
- Date: Wed, 13 May 2020 06:06:23 GMT
- Title: Parallel Corpus Filtering via Pre-trained Language Models
- Authors: Boliang Zhang, Ajay Nagesh, and Kevin Knight
- Abstract summary: Web-crawled data provides a good source of parallel corpora for training machine translation models.
Recent work shows that neural machine translation systems are more sensitive to noise than traditional statistical machine translation methods.
We propose a novel approach to filter out noisy sentence pairs from web-crawled corpora via pre-trained language models.
- Score: 14.689457985200141
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Web-crawled data provides a good source of parallel corpora for training
machine translation models. It is automatically obtained, but extremely noisy,
and recent work shows that neural machine translation systems are more
sensitive to noise than traditional statistical machine translation methods. In
this paper, we propose a novel approach to filter out noisy sentence pairs from
web-crawled corpora via pre-trained language models. We measure sentence
parallelism by leveraging the multilingual capability of BERT and use the
Generative Pre-training (GPT) language model as a domain filter to balance data
domains. We evaluate the proposed method on the WMT 2018 Parallel Corpus
Filtering shared task, and on our own web-crawled Japanese-Chinese parallel
corpus. Our method significantly outperforms baselines and achieves a new
state-of-the-art. In an unsupervised setting, our method achieves comparable
performance to the top-1 supervised method. We also evaluate on a web-crawled
Japanese-Chinese parallel corpus that we make publicly available.
Related papers
- Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - Summarize and Generate to Back-translate: Unsupervised Translation of
Programming Languages [86.08359401867577]
Back-translation is widely known for its effectiveness for neural machine translation when little to no parallel data is available.
We propose performing back-translation via code summarization and generation.
We show that our proposed approach performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2022-05-23T08:20:41Z) - Non-Parametric Unsupervised Domain Adaptation for Neural Machine
Translation [61.27321597981737]
$k$NN-MT has shown the promising capability of directly incorporating the pre-trained neural machine translation (NMT) model with domain-specific token-level $k$-nearest-neighbor retrieval.
We propose a novel framework that directly uses in-domain monolingual sentences in the target language to construct an effective datastore for $k$-nearest-neighbor retrieval.
arXiv Detail & Related papers (2021-09-14T11:50:01Z) - Cross-language Sentence Selection via Data Augmentation and Rationale
Training [22.106577427237635]
It uses data augmentation and negative sampling techniques on noisy parallel sentence data to learn a cross-lingual embedding-based query relevance model.
Results show that this approach performs as well as or better than multiple state-of-the-art machine translation + monolingual retrieval systems trained on the same parallel data.
arXiv Detail & Related papers (2021-06-04T07:08:47Z) - Meta Back-translation [111.87397401837286]
We propose a novel method to generate pseudo-parallel data from a pre-trained back-translation model.
Our method is a meta-learning algorithm which adapts a pre-trained back-translation model so that the pseudo-parallel data it generates would train a forward-translation model to do well on a validation set.
arXiv Detail & Related papers (2021-02-15T20:58:32Z) - Word Alignment by Fine-tuning Embeddings on Parallel Corpora [96.28608163701055]
Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs.
Recently, other work has demonstrated that pre-trained contextualized word embeddings derived from multilingually trained language models (LMs) prove an attractive alternative, achieving competitive results on the word alignment task even in the absence of explicit training on parallel data.
In this paper, we examine methods to marry the two approaches: leveraging pre-trained LMs but fine-tuning them on parallel text with objectives designed to improve alignment quality, and proposing
arXiv Detail & Related papers (2021-01-20T17:54:47Z) - Unsupervised Parallel Corpus Mining on Web Data [53.74427402568838]
We present a pipeline to mine the parallel corpus from the Internet in an unsupervised manner.
Our system produces new state-of-the-art results, 39.81 and 38.95 BLEU scores, even compared with supervised approaches.
arXiv Detail & Related papers (2020-09-18T02:38:01Z) - Capturing document context inside sentence-level neural machine
translation models with self-training [5.129814362802968]
Document-level neural machine translation has received less attention and lags behind its sentence-level counterpart.
We propose an approach that doesn't require training a specialized model on parallel document-level corpora.
Our approach reinforces the choices made by the model, thus making it more likely that the same choices will be made in other sentences in the document.
arXiv Detail & Related papers (2020-03-11T12:36:17Z) - Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures
Translation [37.04364877980479]
We show how to mine a parallel corpus from publicly available lectures at Coursera.
Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations.
For Japanese--English lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets.
arXiv Detail & Related papers (2019-12-26T01:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.