Selective Knowledge Distillation for Neural Machine Translation
- URL: http://arxiv.org/abs/2105.12967v1
- Date: Thu, 27 May 2021 06:54:12 GMT
- Title: Selective Knowledge Distillation for Neural Machine Translation
- Authors: Fusheng Wang, Jianhao Yan, Fandong Meng, Jie Zhou
- Abstract summary: knowledge distillation is widely applied to enhance the model's performance by transferring teacher model's knowledge on each training sample.
Previous work rarely discusses the different impacts and connections among these samples, which serve as the medium for transferring teacher knowledge.
We propose two simple yet effective strategies, i.e., batch-level and global-level selections, to pick suitable samples for distillation.
- Score: 24.493705133103443
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural Machine Translation (NMT) models achieve state-of-the-art performance
on many translation benchmarks. As an active research field in NMT, knowledge
distillation is widely applied to enhance the model's performance by
transferring teacher model's knowledge on each training sample. However,
previous work rarely discusses the different impacts and connections among
these samples, which serve as the medium for transferring teacher knowledge. In
this paper, we design a novel protocol that can effectively analyze the
different impacts of samples by comparing various samples' partitions. Based on
above protocol, we conduct extensive experiments and find that the teacher's
knowledge is not the more, the better. Knowledge over specific samples may even
hurt the whole performance of knowledge distillation. Finally, to address these
issues, we propose two simple yet effective strategies, i.e., batch-level and
global-level selections, to pick suitable samples for distillation. We evaluate
our approaches on two large-scale machine translation tasks, WMT'14
English->German and WMT'19 Chinese->English. Experimental results show that our
approaches yield up to +1.28 and +0.89 BLEU points improvements over the
Transformer baseline, respectively.
Related papers
- Don't Throw Away Data: Better Sequence Knowledge Distillation [60.60698363739434]
In this paper we seek to integrate minimum Bayes risk (MBR) decoding more tightly in knowledge distillation training.
Our experiments on English to German and English to Japanese translation show consistent improvements over strong baseline methods.
arXiv Detail & Related papers (2024-07-15T06:11:18Z) - TasTe: Teaching Large Language Models to Translate through Self-Reflection [82.83958470745381]
Large language models (LLMs) have exhibited remarkable performance in various natural language processing tasks.
We propose the TasTe framework, which stands for translating through self-reflection.
The evaluation results in four language directions on the WMT22 benchmark reveal the effectiveness of our approach compared to existing methods.
arXiv Detail & Related papers (2024-06-12T17:21:21Z) - MT-PATCHER: Selective and Extendable Knowledge Distillation from Large Language Models for Machine Translation [61.65537912700187]
Large Language Models (LLM) have demonstrated their strong ability in the field of machine translation (MT)
We propose a framework called MT-Patcher, which transfers knowledge from LLMs to existing MT models in a selective, comprehensive and proactive manner.
arXiv Detail & Related papers (2024-03-14T16:07:39Z) - A Comparative Analysis of Task-Agnostic Distillation Methods for
Compressing Transformer Language Models [5.818750175599656]
We reproduce, compare and analyze several methods for task-agnostic (general-purpose) distillation of Transformer language models.
Our target of study includes Output Distribution (OD) transfer, Hidden State (HS) transfer with various layer mapping strategies, and Multi-Head Attention (MHA) transfer based on MiniLMv2.
arXiv Detail & Related papers (2023-10-13T01:00:15Z) - Accurate Knowledge Distillation with n-best Reranking [2.9526110883017433]
We propose utilizing n-best reranking to enhance Sequence-Level Knowledge Distillation (Kim and Rush, 2016)
We leverage a diverse set of models with different inductive biases, objective functions or architectures, including some publicly-available large language models, to pick the highest-quality hypotheses as labels.
Our results demonstrate that utilizing pseudo-labels generated by our n-best reranker leads to a significantly more accurate student model.
arXiv Detail & Related papers (2023-05-20T01:53:03Z) - Life-long Learning for Multilingual Neural Machine Translation with
Knowledge Distillation [48.96946395851039]
A common scenario of Multilingual Neural Machine Translation (MNMT) is that each translation task arrives in a sequential manner, and the training data of previous tasks is unavailable.
We propose a multilingual distillation method to make the new model jointly learn multilingual output from old model (teacher) and new task.
The experimental results on twelve translation tasks show that the proposed methods can better consolidate the previous knowledge and sharply alleviate the CF.
arXiv Detail & Related papers (2022-12-06T07:36:16Z) - Exploiting Curriculum Learning in Unsupervised Neural Machine
Translation [28.75229367700697]
We propose a curriculum learning method to gradually utilize pseudo bi-texts based on their quality from multiple granularities.
Experimental results on WMT 14 En-Fr, WMT 16 En-De, WMT 16 En-Ro, and LDC En-Zh translation tasks demonstrate that the proposed method achieves consistent improvements with faster convergence speed.
arXiv Detail & Related papers (2021-09-23T07:18:06Z) - Modelling Latent Translations for Cross-Lingual Transfer [47.61502999819699]
We propose a new technique that integrates both steps of the traditional pipeline (translation and classification) into a single model.
We evaluate our novel latent translation-based model on a series of multilingual NLU tasks.
We report gains for both zero-shot and few-shot learning setups, up to 2.7 accuracy points on average.
arXiv Detail & Related papers (2021-07-23T17:11:27Z) - Fine-Tuning Pretrained Language Models: Weight Initializations, Data
Orders, and Early Stopping [62.78338049381917]
Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing.
We experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds.
We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials.
arXiv Detail & Related papers (2020-02-15T02:40:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.