Related papers: Learning Light-Weight Translation Models from Deep Transformer

Learning Light-Weight Translation Models from Deep Transformer

URL: http://arxiv.org/abs/2012.13866v1
Date: Sun, 27 Dec 2020 05:33:21 GMT
Title: Learning Light-Weight Translation Models from Deep Transformer
Authors: Bei Li, Ziyang Wang, Hui Liu, Quan Du, Tong Xiao, Chunliang Zhang and Jingbo Zhu
Abstract summary: We propose a novel group-permutation based knowledge distillation approach to compressing the deep Transformer model into a shallow model. Our compressed model is 8X shallower than the deep model, with almost no loss in BLEU. To further enhance the teacher model, we present a Skipping Sub-Layer method to randomly omit sub-layers to introduce perturbation into training.
Score: 25.386460662408773
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, deep models have shown tremendous improvements in neural machine translation (NMT). However, systems of this kind are computationally expensive and memory intensive. In this paper, we take a natural step towards learning strong but light-weight NMT systems. We proposed a novel group-permutation based knowledge distillation approach to compressing the deep Transformer model into a shallow model. The experimental results on several benchmarks validate the effectiveness of our method. Our compressed model is 8X shallower than the deep model, with almost no loss in BLEU. To further enhance the teacher model, we present a Skipping Sub-Layer method to randomly omit sub-layers to introduce perturbation into training, which achieves a BLEU score of 30.63 on English-German newstest2014. The code is publicly available at https://github.com/libeineu/GPKD.

Related papers

BEND: Bagging Deep Learning Training Based on Efficient Neural Network Diffusion [56.9358325168226]
We propose a Bagging deep learning training algorithm based on Efficient Neural network Diffusion (BEND) Our approach is simple but effective, first using multiple trained model weights and biases as inputs to train autoencoder and latent diffusion model. Our proposed BEND algorithm can consistently outperform the mean and median accuracies of both the original trained model and the diffused model.
arXiv Detail & Related papers (2024-03-23T08:40:38Z)
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z)
MiniLLM: Knowledge Distillation of Large Language Models [112.93051247165089]
Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs) We propose a KD approach that distills LLMs into smaller language models. Our method is scalable for different model families with 120M to 13B parameters.
arXiv Detail & Related papers (2023-06-14T14:44:03Z)
Enhancing Black-Box Few-Shot Text Classification with Prompt-Based Data Augmentation [42.05617728412819]
We show how to optimize few-shot text classification without accessing the gradients of the large-scale language models. Our approach, dubbed BT-Classifier, significantly outperforms state-of-the-art black-box few-shot learners.
arXiv Detail & Related papers (2023-05-23T07:54:34Z)
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed. We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z)
Learning Kernel-Smoothed Machine Translation with Retrieved Examples [30.17061384497846]
Existing non-parametric approaches that retrieve similar examples from a database to guide the translation process are promising but are prone to overfit the retrieved examples. We propose to learn Kernel-Smoothed Translation with Example Retrieval (KSTER), an effective approach to adapt neural machine translation models online.
arXiv Detail & Related papers (2021-09-21T06:42:53Z)
R-Drop: Regularized Dropout for Neural Networks [99.42791938544012]
Dropout is a powerful and widely used technique to regularize the training of deep neural networks. We introduce a simple regularization strategy upon dropout in model training, namely R-Drop, which forces the output distributions of different sub models to be consistent with each other.
arXiv Detail & Related papers (2021-06-28T08:01:26Z)
Recurrent Stacking of Layers in Neural Networks: An Application to Neural Machine Translation [18.782750537161615]
We propose to share parameters across all layers thereby leading to a recurrently stacked neural network model. We empirically demonstrate that the translation quality of a model that recurrently stacks a single layer 6 times, despite having significantly fewer parameters, approaches that of a model that stacks 6 layers where each layer has different parameters.
arXiv Detail & Related papers (2021-06-18T08:48:01Z)
Shallow-to-Deep Training for Neural Machine Translation [42.62107851930165]
In this paper, we investigate the behavior of a well-tuned deep Transformer system. We find that stacking layers is helpful in improving the representation ability of NMT models. This inspires us to develop a shallow-to-deep training method that learns deep models by stacking shallow models.
arXiv Detail & Related papers (2020-10-08T02:36:07Z)
Very Deep Transformers for Neural Machine Translation [100.51465892354234]
We show that it is feasible to build standard Transformer-based models with up to 60 encoder layers and 12 decoder layers. These deep models outperform their baseline 6-layer counterparts by as much as 2.5 BLEU.
arXiv Detail & Related papers (2020-08-18T07:14:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.