Related papers: Collective Wisdom: Improving Low-resource Neural Machine Translation using Adaptive Knowledge Distillation

Collective Wisdom: Improving Low-resource Neural Machine Translation using Adaptive Knowledge Distillation

URL: http://arxiv.org/abs/2010.05445v1
Date: Mon, 12 Oct 2020 04:26:46 GMT
Title: Collective Wisdom: Improving Low-resource Neural Machine Translation using Adaptive Knowledge Distillation
Authors: Fahimeh Saleh, Wray Buntine, Gholamreza Haffari
Abstract summary: Scarcity of parallel sentence-pairs poses a significant hurdle for training high-quality Neural Machine Translation (NMT) models in bilingually low-resource scenarios. We propose an adaptive knowledge distillation approach to dynamically adjust the contribution of the teacher models during the distillation process. Experiments on transferring from a collection of six language pairs from IWSLT to five low-resource language-pairs from TED Talks demonstrate the effectiveness of our approach.
Score: 42.38435539241788
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scarcity of parallel sentence-pairs poses a significant hurdle for training high-quality Neural Machine Translation (NMT) models in bilingually low-resource scenarios. A standard approach is transfer learning, which involves taking a model trained on a high-resource language-pair and fine-tuning it on the data of the low-resource MT condition of interest. However, it is not clear generally which high-resource language-pair offers the best transfer learning for the target MT setting. Furthermore, different transferred models may have complementary semantic and/or syntactic strengths, hence using only one model may be sub-optimal. In this paper, we tackle this problem using knowledge distillation, where we propose to distill the knowledge of ensemble of teacher models to a single student model. As the quality of these teacher models varies, we propose an effective adaptive knowledge distillation approach to dynamically adjust the contribution of the teacher models during the distillation process. Experiments on transferring from a collection of six language pairs from IWSLT to five low-resource language-pairs from TED Talks demonstrate the effectiveness of our approach, achieving up to +0.9 BLEU score improvement compared to strong baselines.

Related papers

A Practitioner's Guide to Building ASR Models for Low-Resource Languages: A Case Study on Scottish Gaelic [9.380879437204277]
An effective approach to the development of ASR systems for low-resource languages is to fine-tune an existing multilingual end-to-end model.<n>We show that an approach combining hybrid HMMs with self-supervised models can yield substantially better performance with limited training data.<n>We benchmark our approach on Scottish Gaelic, achieving WER reductions of 32% relative over our best fine-tuned Whisper model.
arXiv Detail & Related papers (2025-06-05T11:52:08Z)
MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting [53.77590764277568]
We introduce a novel MoE-CT architecture that separates the base model's learning from the multilingual expansion process. Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency.
arXiv Detail & Related papers (2024-06-25T11:03:45Z)
MT-PATCHER: Selective and Extendable Knowledge Distillation from Large Language Models for Machine Translation [61.65537912700187]
Large Language Models (LLM) have demonstrated their strong ability in the field of machine translation (MT) We propose a framework called MT-Patcher, which transfers knowledge from LLMs to existing MT models in a selective, comprehensive and proactive manner.
arXiv Detail & Related papers (2024-03-14T16:07:39Z)
Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP) What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z)
Too Brittle To Touch: Comparing the Stability of Quantization and Distillation Towards Developing Lightweight Low-Resource MT Models [12.670354498961492]
State-of-the-art machine translation models are often able to adapt to the paucity of data for low-resource languages. Knowledge Distillation is one popular technique to develop competitive, lightweight models.
arXiv Detail & Related papers (2022-10-27T05:30:13Z)
High-resource Language-specific Training for Multilingual Neural Machine Translation [109.31892935605192]
We propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference. Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder. HLT-MT is further trained on all available corpora to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2022-07-11T14:33:13Z)
UM4: Unified Multilingual Multiple Teacher-Student Model for Zero-Resource Neural Machine Translation [102.04003089261761]
Multilingual neural machine translation (MNMT) enables one-pass translation using shared semantic space for all languages. We propose a novel method, named as Unified Multilingual Multiple teacher-student Model for NMT (UM4) Our method unifies source-teacher, target-teacher, and pivot-teacher models to guide the student model for the zero-resource translation.
arXiv Detail & Related papers (2022-07-11T14:22:59Z)
Multilingual Neural Machine Translation:Can Linguistic Hierarchies Help? [29.01386302441015]
Multilingual Neural Machine Translation (MNMT) trains a single NMT model that supports translation between multiple languages. The performance of an MNMT model is highly dependent on the type of languages used in training, as transferring knowledge from a diverse set of languages degrades the translation performance due to negative transfer. We propose a Hierarchical Knowledge Distillation (HKD) approach for MNMT which capitalises on language groups generated according to typological features and phylogeny of languages to overcome the issue of negative transfer.
arXiv Detail & Related papers (2021-10-15T02:31:48Z)
MergeDistill: Merging Pre-trained Language Models using Distillation [5.396915402673246]
We propose MergeDistill, a framework to merge pre-trained LMs in a way that can best leverage their assets with minimal dependencies. We demonstrate the applicability of our framework in a practical setting by leveraging pre-existing teacher LMs and training student LMs that perform competitively with or even outperform teacher LMs trained on several orders of magnitude more data and with a fixed model capacity.
arXiv Detail & Related papers (2021-06-05T08:22:05Z)
Fine-tuning BERT for Low-Resource Natural Language Understanding via Active Learning [30.5853328612593]
In this work, we explore fine-tuning methods of BERT -- a pre-trained Transformer based language model. Our experimental results show an advantage in model performance by maximizing the approximate knowledge gain of the model. We analyze the benefits of freezing layers of the language model during fine-tuning to reduce the number of trainable parameters.
arXiv Detail & Related papers (2020-12-04T08:34:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.