Slimmable Networks for Contrastive Self-supervised Learning
- URL: http://arxiv.org/abs/2209.15525v2
- Date: Tue, 23 May 2023 12:20:31 GMT
- Title: Slimmable Networks for Contrastive Self-supervised Learning
- Authors: Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang
- Abstract summary: Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We present a one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
- Score: 67.21528544724546
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Self-supervised learning makes significant progress in pre-training large
models, but struggles with small models. Previous solutions to this problem
rely mainly on knowledge distillation, which involves a two-stage procedure:
first training a large teacher model and then distilling it to improve the
generalization ability of smaller ones. In this work, we present a one-stage
solution to obtain pre-trained small models without the need for extra
teachers, namely, slimmable networks for contrastive self-supervised learning
(\emph{SlimCLR}). A slimmable network consists of a full network and several
weight-sharing sub-networks, which can be pre-trained once to obtain various
networks, including small ones with low computation costs. However,
interference between weight-sharing networks leads to severe performance
degradation in self-supervised cases, as evidenced by \emph{gradient magnitude
imbalance} and \emph{gradient direction divergence}. The former indicates that
a small proportion of parameters produce dominant gradients during
backpropagation, while the main parameters may not be fully optimized. The
latter shows that the gradient direction is disordered, and the optimization
process is unstable. To address these issues, we introduce three techniques to
make the main parameters produce dominant gradients and sub-networks have
consistent outputs. These techniques include slow start training of
sub-networks, online distillation, and loss re-weighting according to model
sizes. Furthermore, theoretical results are presented to demonstrate that a
single slimmable linear layer is sub-optimal during linear evaluation. Thus a
switchable linear probe layer is applied during linear evaluation. We
instantiate SlimCLR with typical contrastive learning frameworks and achieve
better performance than previous arts with fewer parameters and FLOPs.
Related papers
- Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters.
In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z) - Learning to Weight Samples for Dynamic Early-exiting Networks [35.03752825893429]
Early exiting is an effective paradigm for improving the inference efficiency of deep networks.
Our work proposes to adopt a weight prediction network to weight the loss of different training samples at each exit.
We show that the proposed weighting mechanism consistently improves the trade-off between classification accuracy and inference efficiency.
arXiv Detail & Related papers (2022-09-17T10:46:32Z) - Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer.
We show that there is a natural synergy between these two settings.
We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance.
We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z) - Initialization and Regularization of Factorized Neural Layers [23.875225732697142]
We show how to initialize and regularize Factorized layers in deep nets.
We show how these schemes lead to improved performance on both translation and unsupervised pre-training.
arXiv Detail & Related papers (2021-05-03T17:28:07Z) - Learning Neural Network Subspaces [74.44457651546728]
Recent observations have advanced our understanding of the neural network optimization landscape.
With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks.
With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks.
arXiv Detail & Related papers (2021-02-20T23:26:58Z) - Over-parametrized neural networks as under-determined linear systems [31.69089186688224]
We show that it is unsurprising simple neural networks can achieve zero training loss.
We show that kernels typically associated with the ReLU activation function have fundamental flaws.
We propose new activation functions that avoid the pitfalls of ReLU in that they admit zero training loss solutions for any set of distinct data points.
arXiv Detail & Related papers (2020-10-29T21:43:00Z) - HALO: Learning to Prune Neural Networks with Shrinkage [5.283963846188862]
Deep neural networks achieve state-of-the-art performance in a variety of tasks by extracting a rich set of features from unstructured data.
Modern techniques for inducing sparsity and reducing model size are (1) network pruning, (2) training with a sparsity inducing penalty, and (3) training a binary mask jointly with the weights of the network.
We present a novel penalty called Hierarchical Adaptive Lasso which learns to adaptively sparsify weights of a given network via trainable parameters.
arXiv Detail & Related papers (2020-08-24T04:08:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.