Separating Grains from the Chaff: Using Data Filtering to Improve
Multilingual Translation for Low-Resourced African Languages
- URL: http://arxiv.org/abs/2210.10692v2
- Date: Thu, 20 Oct 2022 14:18:49 GMT
- Title: Separating Grains from the Chaff: Using Data Filtering to Improve
Multilingual Translation for Low-Resourced African Languages
- Authors: Idris Abdulmumin, Michael Beukman, Jesujoba O. Alabi, Chris Emezue,
Everlyn Asiko, Tosin Adewumi, Shamsuddeen Hassan Muhammad, Mofetoluwa
Adeyemi, Oreen Yousuf, Sahib Singh, Tajuddeen Rabiu Gwadabe
- Abstract summary: This work describes our approach, which is based on filtering the given noisy data using a sentence-pair classifier.
We empirically validate our approach by evaluating on two common datasets and show that data filtering generally improves overall translation quality.
- Score: 0.6947064688250465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We participated in the WMT 2022 Large-Scale Machine Translation Evaluation
for the African Languages Shared Task. This work describes our approach, which
is based on filtering the given noisy data using a sentence-pair classifier
that was built by fine-tuning a pre-trained language model. To train the
classifier, we obtain positive samples (i.e. high-quality parallel sentences)
from a gold-standard curated dataset and extract negative samples (i.e.
low-quality parallel sentences) from automatically aligned parallel data by
choosing sentences with low alignment scores. Our final machine translation
model was then trained on filtered data, instead of the entire noisy dataset.
We empirically validate our approach by evaluating on two common datasets and
show that data filtering generally improves overall translation quality, in
some cases even significantly.
Related papers
- A Case Study on Filtering for End-to-End Speech Translation [32.676738355929466]
It is relatively easy to mine a large parallel corpus for any machine learning task, such as speech-to-text or speech-to-speech translation.
This work shows that the simplest filtering technique can trim down these big, noisy datasets to a more manageable, clean dataset.
arXiv Detail & Related papers (2024-02-02T22:42:33Z) - There's no Data Like Better Data: Using QE Metrics for MT Data Filtering [25.17221095970304]
We analyze the viability of using QE metrics for filtering out bad quality sentence pairs in the training data of neural machine translation systems(NMT)
We show that by selecting the highest quality sentence pairs in the training data, we can improve translation quality while reducing the training size by half.
arXiv Detail & Related papers (2023-11-09T13:21:34Z) - Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - In-context Examples Selection for Machine Translation [101.50473468507697]
Large-scale generative models show an impressive ability to perform a wide range of Natural Language Processing (NLP) tasks using in-context learning.
For Machine Translation (MT), these examples are typically randomly sampled from the development dataset with a similar distribution as the evaluation set.
We show that the translation quality and the domain of the in-context examples matter and that 1-shot noisy unrelated example can have a catastrophic impact on output quality.
arXiv Detail & Related papers (2022-12-05T17:25:15Z) - OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource
Language Pair for Low-Resource Sentence Retrieval [91.76575626229824]
We present OneAligner, an alignment model specially designed for sentence retrieval tasks.
When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result.
We conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size.
arXiv Detail & Related papers (2022-05-17T19:52:42Z) - Analyzing the Use of Character-Level Translation with Sparse and Noisy
Datasets [20.50917929755389]
We find that character-level models cut the number of untranslated words by over 40% when applied to sparse and noisy datasets.
We explore the impact of character alignment, phrase table filtering, bitext size and the choice of pivot language on translation quality.
Neither word-nor character-BLEU correlate perfectly with human judgments, due to BLEU's sensitivity to length.
arXiv Detail & Related papers (2021-09-27T07:35:47Z) - Improving Multilingual Translation by Representation and Gradient
Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level.
Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z) - Cross-language Sentence Selection via Data Augmentation and Rationale
Training [22.106577427237635]
It uses data augmentation and negative sampling techniques on noisy parallel sentence data to learn a cross-lingual embedding-based query relevance model.
Results show that this approach performs as well as or better than multiple state-of-the-art machine translation + monolingual retrieval systems trained on the same parallel data.
arXiv Detail & Related papers (2021-06-04T07:08:47Z) - Self-Training Sampling with Monolingual Data Uncertainty for Neural
Machine Translation [98.83925811122795]
We propose to improve the sampling procedure by selecting the most informative monolingual sentences to complement the parallel data.
We compute the uncertainty of monolingual sentences using the bilingual dictionary extracted from the parallel data.
Experimental results on large-scale WMT English$Rightarrow$German and English$Rightarrow$Chinese datasets demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2021-06-02T05:01:36Z) - Meta Back-translation [111.87397401837286]
We propose a novel method to generate pseudo-parallel data from a pre-trained back-translation model.
Our method is a meta-learning algorithm which adapts a pre-trained back-translation model so that the pseudo-parallel data it generates would train a forward-translation model to do well on a validation set.
arXiv Detail & Related papers (2021-02-15T20:58:32Z) - Parallel Corpus Filtering via Pre-trained Language Models [14.689457985200141]
Web-crawled data provides a good source of parallel corpora for training machine translation models.
Recent work shows that neural machine translation systems are more sensitive to noise than traditional statistical machine translation methods.
We propose a novel approach to filter out noisy sentence pairs from web-crawled corpora via pre-trained language models.
arXiv Detail & Related papers (2020-05-13T06:06:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.