Related papers: Weight Distillation: Transferring the Knowledge in Neural Network Parameters

Weight Distillation: Transferring the Knowledge in Neural Network Parameters

URL: http://arxiv.org/abs/2009.09152v3
Date: Mon, 19 Jul 2021 04:37:21 GMT
Title: Weight Distillation: Transferring the Knowledge in Neural Network Parameters
Authors: Ye Lin, Yanyang Li, Ziyang Wang, Bei Li, Quan Du, Tong Xiao, Jingbo Zhu
Abstract summary: We propose Weight Distillation to transfer the knowledge in the large network parameters through a parameter generator. Experiments on WMT16 En-Ro, NIST12 Zh-En, and WMT14 En-De machine translation tasks show that weight distillation can train a small network that is 1.882.94x faster than the large network but with competitive performance.
Score: 48.32204633079697
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge distillation has been proven to be effective in model acceleration and compression. It allows a small network to learn to generalize in the same way as a large network. Recent successes in pre-training suggest the effectiveness of transferring model parameters. Inspired by this, we investigate methods of model acceleration and compression in another line of research. We propose Weight Distillation to transfer the knowledge in the large network parameters through a parameter generator. Our experiments on WMT16 En-Ro, NIST12 Zh-En, and WMT14 En-De machine translation tasks show that weight distillation can train a small network that is 1.88~2.94x faster than the large network but with competitive performance. With the same sized small network, weight distillation can outperform knowledge distillation by 0.51~1.82 BLEU points.

Related papers

Learning to Compose SuperWeights for Neural Parameter Allocation Search [61.078949532440724]
We show that our approach can generate parameters for many network using the same set of weights. This enables us to support tasks like efficient ensembling and anytime prediction.
arXiv Detail & Related papers (2023-12-03T04:20:02Z)
AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models [119.7093605087114]
Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters. This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation. We introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques.
arXiv Detail & Related papers (2022-05-24T23:41:22Z)
Kernel Modulation: A Parameter-Efficient Method for Training Convolutional Neural Networks [19.56633207984127]
This work proposes a novel parameter-efficient kernel modulation (KM) method that adapts all parameters of a base network instead of a subset of layers. KM uses lightweight task-specialized kernel modulators that require only an additional 1.4% of the base network parameters. Our results show that KM delivers up to 9% higher accuracy than other parameter-efficient methods on the Transfer Learning benchmark.
arXiv Detail & Related papers (2022-03-29T07:28:50Z)
An Experimental Study of the Impact of Pre-training on the Pruning of a Convolutional Neural Network [0.0]
In recent years, deep neural networks have known a wide success in various application domains. Deep neural networks usually involve a large number of parameters, which correspond to the weights of the network. The pruning methods notably attempt to reduce the size of the parameter set, by identifying and removing the irrelevant weights.
arXiv Detail & Related papers (2021-12-15T16:02:15Z)
Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models. Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely. Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z)
Multi-Prize Lottery Ticket Hypothesis: Finding Accurate Binary Neural Networks by Pruning A Randomly Weighted Network [13.193734014710582]
We propose an algorithm for finding multi-prize tickets (MPTs) and test it by performing a series of experiments on CIFAR-10 and ImageNet datasets. Our MPTs-1/32 not only set new binary weight network state-of-the-art (SOTA) Top-1 accuracy -- 94.8% on CIFAR-10 and 74.03% on ImageNet -- but also outperform their full-precision counterparts by 1.78% and 0.76%, respectively.
arXiv Detail & Related papers (2021-03-17T00:31:24Z)
Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch [75.69506249886622]
Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments. In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network.
arXiv Detail & Related papers (2021-02-08T05:55:47Z)
HALO: Learning to Prune Neural Networks with Shrinkage [5.283963846188862]
Deep neural networks achieve state-of-the-art performance in a variety of tasks by extracting a rich set of features from unstructured data. Modern techniques for inducing sparsity and reducing model size are (1) network pruning, (2) training with a sparsity inducing penalty, and (3) training a binary mask jointly with the weights of the network. We present a novel penalty called Hierarchical Adaptive Lasso which learns to adaptively sparsify weights of a given network via trainable parameters.
arXiv Detail & Related papers (2020-08-24T04:08:48Z)
Go Wide, Then Narrow: Efficient Training of Deep Thin Networks [62.26044348366186]
We propose an efficient method to train a deep thin network with a theoretic guarantee. By training with our method, ResNet50 can outperform ResNet101, and BERT Base can be comparable with BERT Large.
arXiv Detail & Related papers (2020-07-01T23:34:35Z)
Principal Component Networks: Parameter Reduction Early in Training [10.14522349959932]
We show how to find small networks that exhibit the same performance as their over parameterized counterparts. We use PCA to find a basis of high variance for layer inputs and represent layer weights using these directions. We also show that ResNet-20 PCNs outperform deep ResNet-110 networks while training faster.
arXiv Detail & Related papers (2020-06-23T21:40:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.