"A Little is Enough": Few-Shot Quality Estimation based Corpus Filtering
improves Machine Translation
- URL: http://arxiv.org/abs/2306.03507v1
- Date: Tue, 6 Jun 2023 08:53:01 GMT
- Title: "A Little is Enough": Few-Shot Quality Estimation based Corpus Filtering
improves Machine Translation
- Authors: Akshay Batheja, Pushpak Bhattacharyya
- Abstract summary: We propose a Quality Estimation based Filtering approach to extract high-quality parallel data from the pseudo-parallel corpus.
We observe an improvement in the Machine Translation (MT) system's performance by up to 1.8 BLEU points, for English-Marathi, Chinese-English, and Hindi-Bengali language pairs.
Our Few-shot QE model transfer learned from the English-Marathi QE model and fine-tuned on only 500 Hindi-Bengali training instances, shows an improvement of up to 0.6 BLEU points for Hindi-Bengali language pair.
- Score: 36.9886023078247
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Quality Estimation (QE) is the task of evaluating the quality of a
translation when reference translation is not available. The goal of QE aligns
with the task of corpus filtering, where we assign the quality score to the
sentence pairs present in the pseudo-parallel corpus. We propose a Quality
Estimation based Filtering approach to extract high-quality parallel data from
the pseudo-parallel corpus. To the best of our knowledge, this is a novel
adaptation of the QE framework to extract quality parallel corpus from the
pseudo-parallel corpus. By training with this filtered corpus, we observe an
improvement in the Machine Translation (MT) system's performance by up to 1.8
BLEU points, for English-Marathi, Chinese-English, and Hindi-Bengali language
pairs, over the baseline model. The baseline model is the one that is trained
on the whole pseudo-parallel corpus. Our Few-shot QE model transfer learned
from the English-Marathi QE model and fine-tuned on only 500 Hindi-Bengali
training instances, shows an improvement of up to 0.6 BLEU points for
Hindi-Bengali language pair, compared to the baseline model. This demonstrates
the promise of transfer learning in the setting under discussion. QE systems
typically require in the order of (7K-25K) of training data. Our Hindi-Bengali
QE is trained on only 500 instances of training that is 1/40th of the normal
requirement and achieves comparable performance. All the scripts and datasets
utilized in this study will be publicly available.
Related papers
- Don't Rank, Combine! Combining Machine Translation Hypotheses Using Quality Estimation [0.6998085564793366]
This work introduces QE-fusion, a method that synthesizes translations using a quality estimation metric (QE)
We demonstrate that our approach generates novel translations in over half of the cases.
We empirically establish that QE-fusion scales linearly with the number of candidates in the pool.
arXiv Detail & Related papers (2024-01-12T16:52:41Z) - APE-then-QE: Correcting then Filtering Pseudo Parallel Corpora for MT
Training Data Creation [48.47548479232714]
We propose a repair-filter-use methodology that uses an APE system to correct errors on the target side of the Machine Translation training data.
We select the sentence pairs from the original and corrected sentence pairs based on the quality scores computed using a Quality Estimation (QE) model.
We observe an improvement in the Machine Translation system's performance by 5.64 and 9.91 BLEU points, for English-Marathi and Marathi-English, over the baseline model.
arXiv Detail & Related papers (2023-12-18T16:06:18Z) - Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - NAIST-SIC-Aligned: an Aligned English-Japanese Simultaneous Interpretation Corpus [23.49376007047965]
It remains a question that how simultaneous interpretation (SI) data affects simultaneous machine translation (SiMT)
We introduce NAIST-SIC-Aligned, which is an automatically-aligned parallel English-Japanese SI dataset.
Our results show that models trained with SI data lead to significant improvement in translation quality and latency over baselines.
arXiv Detail & Related papers (2023-04-23T23:03:58Z) - Original or Translated? On the Use of Parallel Data for Translation
Quality Estimation [81.27850245734015]
We demonstrate a significant gap between parallel data and real QE data.
For parallel data, it is indiscriminate and the translationese may occur on either source or target side.
We find that using the source-original part of parallel corpus consistently outperforms its target-original counterpart.
arXiv Detail & Related papers (2022-12-20T14:06:45Z) - OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource
Language Pair for Low-Resource Sentence Retrieval [91.76575626229824]
We present OneAligner, an alignment model specially designed for sentence retrieval tasks.
When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result.
We conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size.
arXiv Detail & Related papers (2022-05-17T19:52:42Z) - Ensemble-based Transfer Learning for Low-resource Machine Translation
Quality Estimation [1.7188280334580195]
We focus on the Sentence-Level QE Shared Task of the Fifth Conference on Machine Translation (WMT20)
We propose an ensemble-based predictor-estimator QE model with transfer learning to overcome such QE data scarcity challenge.
We achieve the best performance on the ensemble model combining the models pretrained by individual languages as well as different levels of parallel trained corpus with a Pearson's correlation of 0.298.
arXiv Detail & Related papers (2021-05-17T06:02:17Z) - Parallel Corpus Filtering via Pre-trained Language Models [14.689457985200141]
Web-crawled data provides a good source of parallel corpora for training machine translation models.
Recent work shows that neural machine translation systems are more sensitive to noise than traditional statistical machine translation methods.
We propose a novel approach to filter out noisy sentence pairs from web-crawled corpora via pre-trained language models.
arXiv Detail & Related papers (2020-05-13T06:06:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.