APE-then-QE: Correcting then Filtering Pseudo Parallel Corpora for MT
Training Data Creation
- URL: http://arxiv.org/abs/2312.11312v1
- Date: Mon, 18 Dec 2023 16:06:18 GMT
- Title: APE-then-QE: Correcting then Filtering Pseudo Parallel Corpora for MT
Training Data Creation
- Authors: Akshay Batheja, Sourabh Deoghare, Diptesh Kanojia, Pushpak
Bhattacharyya
- Abstract summary: We propose a repair-filter-use methodology that uses an APE system to correct errors on the target side of the Machine Translation training data.
We select the sentence pairs from the original and corrected sentence pairs based on the quality scores computed using a Quality Estimation (QE) model.
We observe an improvement in the Machine Translation system's performance by 5.64 and 9.91 BLEU points, for English-Marathi and Marathi-English, over the baseline model.
- Score: 48.47548479232714
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Automatic Post-Editing (APE) is the task of automatically identifying and
correcting errors in the Machine Translation (MT) outputs. We propose a
repair-filter-use methodology that uses an APE system to correct errors on the
target side of the MT training data. We select the sentence pairs from the
original and corrected sentence pairs based on the quality scores computed
using a Quality Estimation (QE) model. To the best of our knowledge, this is a
novel adaptation of APE and QE to extract quality parallel corpus from the
pseudo-parallel corpus. By training with this filtered corpus, we observe an
improvement in the Machine Translation system's performance by 5.64 and 9.91
BLEU points, for English-Marathi and Marathi-English, over the baseline model.
The baseline model is the one that is trained on the whole pseudo-parallel
corpus. Our work is not limited by the characteristics of English or Marathi
languages; and is language pair-agnostic, given the necessary QE and APE data.
Related papers
- LM-Combiner: A Contextual Rewriting Model for Chinese Grammatical Error Correction [49.0746090186582]
Over-correction is a critical problem in Chinese grammatical error correction (CGEC) task.
Recent work using model ensemble methods can effectively mitigate over-correction and improve the precision of the GEC system.
We propose the LM-Combiner, a rewriting model that can directly modify the over-correction of GEC system outputs without a model ensemble.
arXiv Detail & Related papers (2024-03-26T06:12:21Z) - There's no Data Like Better Data: Using QE Metrics for MT Data Filtering [25.17221095970304]
We analyze the viability of using QE metrics for filtering out bad quality sentence pairs in the training data of neural machine translation systems(NMT)
We show that by selecting the highest quality sentence pairs in the training data, we can improve translation quality while reducing the training size by half.
arXiv Detail & Related papers (2023-11-09T13:21:34Z) - Unify word-level and span-level tasks: NJUNLP's Participation for the
WMT2023 Quality Estimation Shared Task [59.46906545506715]
We introduce the NJUNLP team to the WMT 2023 Quality Estimation (QE) shared task.
Our team submitted predictions for the English-German language pair on all two sub-tasks.
Our models achieved the best results in English-German for both word-level and fine-grained error span detection sub-tasks.
arXiv Detail & Related papers (2023-09-23T01:52:14Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - "A Little is Enough": Few-Shot Quality Estimation based Corpus Filtering
improves Machine Translation [36.9886023078247]
We propose a Quality Estimation based Filtering approach to extract high-quality parallel data from the pseudo-parallel corpus.
We observe an improvement in the Machine Translation (MT) system's performance by up to 1.8 BLEU points, for English-Marathi, Chinese-English, and Hindi-Bengali language pairs.
Our Few-shot QE model transfer learned from the English-Marathi QE model and fine-tuned on only 500 Hindi-Bengali training instances, shows an improvement of up to 0.6 BLEU points for Hindi-Bengali language pair.
arXiv Detail & Related papers (2023-06-06T08:53:01Z) - Bring More Attention to Syntactic Symmetry for Automatic Postediting of
High-Quality Machine Translations [4.217162744375792]
We propose a linguistically motivated method of regularization that is expected to enhance APE models' understanding of the target language.
Our analysis of experimental results demonstrates that the proposed method helps improving the state-of-the-art architecture's APE quality for high-quality MTs.
arXiv Detail & Related papers (2023-05-17T20:25:19Z) - Original or Translated? On the Use of Parallel Data for Translation
Quality Estimation [81.27850245734015]
We demonstrate a significant gap between parallel data and real QE data.
For parallel data, it is indiscriminate and the translationese may occur on either source or target side.
We find that using the source-original part of parallel corpus consistently outperforms its target-original counterpart.
arXiv Detail & Related papers (2022-12-20T14:06:45Z) - Rethink about the Word-level Quality Estimation for Machine Translation
from Human Judgement [57.72846454929923]
We create a benchmark dataset, emphHJQE, where the expert translators directly annotate poorly translated words.
We propose two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE corpus closer to emphHJQE.
The results show our proposed dataset is more consistent with human judgement and also confirm the effectiveness of the proposed tag correcting strategies.
arXiv Detail & Related papers (2022-09-13T02:37:12Z) - Cross-Lingual Named Entity Recognition Using Parallel Corpus: A New
Approach Using XLM-RoBERTa Alignment [5.747195707763152]
We build an entity alignment model on top of XLM-RoBERTa to project the entities detected on the English part of the parallel data to the target language sentences.
Unlike using translation methods, this approach benefits from natural fluency and nuances in target-language original corpus.
We evaluate this proposed approach over 4 target languages on benchmark data sets and got competitive F1 scores compared to most recent SOTA models.
arXiv Detail & Related papers (2021-01-26T22:19:52Z) - Parallel Corpus Filtering via Pre-trained Language Models [14.689457985200141]
Web-crawled data provides a good source of parallel corpora for training machine translation models.
Recent work shows that neural machine translation systems are more sensitive to noise than traditional statistical machine translation methods.
We propose a novel approach to filter out noisy sentence pairs from web-crawled corpora via pre-trained language models.
arXiv Detail & Related papers (2020-05-13T06:06:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.