Learning Feature Weights using Reward Modeling for Denoising Parallel
Corpora
- URL: http://arxiv.org/abs/2103.06968v1
- Date: Thu, 11 Mar 2021 21:45:45 GMT
- Title: Learning Feature Weights using Reward Modeling for Denoising Parallel
Corpora
- Authors: Gaurav Kumar, Philipp Koehn, Sanjeev Khudanpur
- Abstract summary: This work presents an alternative approach which learns weights for multiple sentence-level features.
We apply this technique to building Neural Machine Translation (NMT) systems using the Paracrawl corpus for Estonian-English.
We analyze the sensitivity of this method to different types of noise and explore if the learned weights generalize to other language pairs.
- Score: 36.292020779233056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large web-crawled corpora represent an excellent resource for improving the
performance of Neural Machine Translation (NMT) systems across several language
pairs. However, since these corpora are typically extremely noisy, their use is
fairly limited. Current approaches to dealing with this problem mainly focus on
filtering using heuristics or single features such as language model scores or
bi-lingual similarity. This work presents an alternative approach which learns
weights for multiple sentence-level features. These feature weights which are
optimized directly for the task of improving translation performance, are used
to score and filter sentences in the noisy corpora more effectively. We provide
results of applying this technique to building NMT systems using the Paracrawl
corpus for Estonian-English and show that it beats strong single feature
baselines and hand designed combinations. Additionally, we analyze the
sensitivity of this method to different types of noise and explore if the
learned weights generalize to other language pairs using the Maltese-English
Paracrawl corpus.
Related papers
- Low-Resource Machine Translation through the Lens of Personalized Federated Learning [26.436144338377755]
We present a new approach that can be applied to Natural Language Tasks with heterogeneous data.
We evaluate it on the Low-Resource Machine Translation task, using the dataset from the Large-Scale Multilingual Machine Translation Shared Task.
In addition to its effectiveness, MeritFed is also highly interpretable, as it can be applied to track the impact of each language used for training.
arXiv Detail & Related papers (2024-06-18T12:50:00Z) - Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning [25.230786853723203]
We propose a noise-robust cross-lingual cross-modal retrieval method for low-resource languages.
We use Machine Translation to construct pseudo-parallel sentence pairs for low-resource languages.
We introduce a multi-view self-distillation method to learn noise-robust target-language representations.
arXiv Detail & Related papers (2022-08-26T09:32:24Z) - How Robust is Neural Machine Translation to Language Imbalance in
Multilingual Tokenizer Training? [86.48323488619629]
We analyze how translation performance changes as the data ratios among languages vary in the tokenizer training corpus.
We find that while relatively better performance is often observed when languages are more equally sampled, the downstream performance is more robust to language imbalance than we usually expected.
arXiv Detail & Related papers (2022-04-29T17:50:36Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z) - Parallel Corpus Filtering via Pre-trained Language Models [14.689457985200141]
Web-crawled data provides a good source of parallel corpora for training machine translation models.
Recent work shows that neural machine translation systems are more sensitive to noise than traditional statistical machine translation methods.
We propose a novel approach to filter out noisy sentence pairs from web-crawled corpora via pre-trained language models.
arXiv Detail & Related papers (2020-05-13T06:06:23Z) - Improving Massively Multilingual Neural Machine Translation and
Zero-Shot Translation [81.7786241489002]
Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations.
We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics.
We propose random online backtranslation to enforce the translation of unseen training language pairs.
arXiv Detail & Related papers (2020-04-24T17:21:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.