Life-long Learning for Multilingual Neural Machine Translation with
Knowledge Distillation
- URL: http://arxiv.org/abs/2212.02800v1
- Date: Tue, 6 Dec 2022 07:36:16 GMT
- Title: Life-long Learning for Multilingual Neural Machine Translation with
Knowledge Distillation
- Authors: Yang Zhao, Junnan Zhu, Lu Xiang, Jiajun Zhang, Yu Zhou, Feifei Zhai,
and Chengqing Zong
- Abstract summary: A common scenario of Multilingual Neural Machine Translation (MNMT) is that each translation task arrives in a sequential manner, and the training data of previous tasks is unavailable.
We propose a multilingual distillation method to make the new model jointly learn multilingual output from old model (teacher) and new task.
The experimental results on twelve translation tasks show that the proposed methods can better consolidate the previous knowledge and sharply alleviate the CF.
- Score: 48.96946395851039
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A common scenario of Multilingual Neural Machine Translation (MNMT) is that
each translation task arrives in a sequential manner, and the training data of
previous tasks is unavailable. In this scenario, the current methods suffer
heavily from catastrophic forgetting (CF). To alleviate the CF, we investigate
knowledge distillation based life-long learning methods. Specifically, in
one-tomany scenario, we propose a multilingual distillation method to make the
new model (student) jointly learn multilingual output from old model (teacher)
and new task. In many-to one scenario, we find that direct distillation faces
the extreme partial distillation problem, and we propose two different methods
to address it: pseudo input distillation and reverse teacher distillation. The
experimental results on twelve translation tasks show that the proposed methods
can better consolidate the previous knowledge and sharply alleviate the CF.
Related papers
- Don't Throw Away Data: Better Sequence Knowledge Distillation [60.60698363739434]
In this paper we seek to integrate minimum Bayes risk (MBR) decoding more tightly in knowledge distillation training.
Our experiments on English to German and English to Japanese translation show consistent improvements over strong baseline methods.
arXiv Detail & Related papers (2024-07-15T06:11:18Z) - Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - Sentence-Level or Token-Level? A Comprehensive Study on Knowledge Distillation [25.58020699235669]
Knowledge distillation, transferring knowledge from a teacher model to a student model, has emerged as a powerful technique in neural machine translation.
In this study, we argue that token-level distillation, with its more complex objective (i.e., distribution), is better suited for simple'' scenarios.
We introduce a novel hybrid method that combines token-level and sentence-level distillation through a gating mechanism.
arXiv Detail & Related papers (2024-04-23T08:29:56Z) - Extending Multilingual Machine Translation through Imitation Learning [60.15671816513614]
Imit-MNMT treats the task as an imitation learning process, which mimicks the behavior of an expert.
We show that our approach significantly improves the translation performance between the new and the original languages.
We also demonstrate that our approach is capable of solving copy and off-target problems.
arXiv Detail & Related papers (2023-11-14T21:04:03Z) - Distilling Efficient Language-Specific Models for Cross-Lingual Transfer [75.32131584449786]
Massively multilingual Transformers (MMTs) are widely used for cross-lingual transfer learning.
MMTs' language coverage makes them unnecessarily expensive to deploy in terms of model size, inference time, energy, and hardware cost.
We propose to extract compressed, language-specific models from MMTs which retain the capacity of the original MMTs for cross-lingual transfer.
arXiv Detail & Related papers (2023-06-02T17:31:52Z) - Distilling a Pretrained Language Model to a Multilingual ASR Model [3.4012007729454816]
We distill the rich knowledge embedded inside a well-trained teacher text model to the student speech model.
We show the superiority of our method on 20 low-resource languages of the CommonVoice dataset with less than 100 hours of speech data.
arXiv Detail & Related papers (2022-06-25T12:36:11Z) - Towards Lifelong Learning of Multilingual Text-To-Speech Synthesis [87.75833205560406]
This work presents a lifelong learning approach to train a multilingual Text-To-Speech (TTS) system.
It does not require pooled data from all languages altogether, and thus alleviates the storage and computation burden.
arXiv Detail & Related papers (2021-10-09T07:00:38Z) - Towards Developing a Multilingual and Code-Mixed Visual Question
Answering System by Knowledge Distillation [20.33235443471006]
We propose a knowledge distillation approach to extend an English language-vision model (teacher) into an equally effective multilingual and code-mixed model (student)
We also create the large-scale multilingual and code-mixed VQA dataset in eleven different language setups.
Experimental results and in-depth analysis show the effectiveness of the proposed VQA model over the pre-trained language-vision models on eleven diverse language setups.
arXiv Detail & Related papers (2021-09-10T03:47:29Z) - Modelling Latent Translations for Cross-Lingual Transfer [47.61502999819699]
We propose a new technique that integrates both steps of the traditional pipeline (translation and classification) into a single model.
We evaluate our novel latent translation-based model on a series of multilingual NLU tasks.
We report gains for both zero-shot and few-shot learning setups, up to 2.7 accuracy points on average.
arXiv Detail & Related papers (2021-07-23T17:11:27Z) - Selective Knowledge Distillation for Neural Machine Translation [24.493705133103443]
knowledge distillation is widely applied to enhance the model's performance by transferring teacher model's knowledge on each training sample.
Previous work rarely discusses the different impacts and connections among these samples, which serve as the medium for transferring teacher knowledge.
We propose two simple yet effective strategies, i.e., batch-level and global-level selections, to pick suitable samples for distillation.
arXiv Detail & Related papers (2021-05-27T06:54:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.