Initialization and Regularization of Factorized Neural Layers
- URL: http://arxiv.org/abs/2105.01029v1
- Date: Mon, 3 May 2021 17:28:07 GMT
- Title: Initialization and Regularization of Factorized Neural Layers
- Authors: Mikhail Khodak and Neil Tenenholtz and Lester Mackey and Nicol\`o Fusi
- Abstract summary: We show how to initialize and regularize Factorized layers in deep nets.
We show how these schemes lead to improved performance on both translation and unsupervised pre-training.
- Score: 23.875225732697142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Factorized layers--operations parameterized by products of two or more
matrices--occur in a variety of deep learning contexts, including compressed
model training, certain types of knowledge distillation, and multi-head
self-attention architectures. We study how to initialize and regularize deep
nets containing such layers, examining two simple, understudied schemes,
spectral initialization and Frobenius decay, for improving their performance.
The guiding insight is to design optimization routines for these networks that
are as close as possible to that of their well-tuned, non-decomposed
counterparts; we back this intuition with an analysis of how the initialization
and regularization schemes impact training with gradient descent, drawing on
modern attempts to understand the interplay of weight-decay and
batch-normalization. Empirically, we highlight the benefits of spectral
initialization and Frobenius decay across a variety of settings. In model
compression, we show that they enable low-rank methods to significantly
outperform both unstructured sparsity and tensor methods on the task of
training low-memory residual networks; analogs of the schemes also improve the
performance of tensor decomposition techniques. For knowledge distillation,
Frobenius decay enables a simple, overcomplete baseline that yields a compact
model from over-parameterized training without requiring retraining with or
pruning a teacher network. Finally, we show how both schemes applied to
multi-head attention lead to improved performance on both translation and
unsupervised pre-training.
Related papers
- Structure-Preserving Network Compression Via Low-Rank Induced Training Through Linear Layers Composition [11.399520888150468]
Deep Neural Networks (DNNs) have achieved remarkable success in addressing many previously unsolvable tasks.
The storage and computational requirements associated with DNNs pose a challenge for deploying these trained models on resource-limited devices.
We present a theoretically-justified novel approach, termed Low-Rank Induced Training (LoRITa)
LoRITa promotes low-rankness through the composition of linear layers and compresses by using singular value truncation.
arXiv Detail & Related papers (2024-05-06T00:58:23Z) - Stacking as Accelerated Gradient Descent [44.17524017365296]
Stacking is a technique for training deep residual networks by progressively increasing the number of layers.
We propose a theoretical explanation for the efficacy of stacking.
We prove that for certain deep linear residual networks, stacking does provide accelerated training.
arXiv Detail & Related papers (2024-03-08T01:23:25Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Rank-adaptive spectral pruning of convolutional layers during training [2.3488056916440856]
We propose a low-parametric training method that factorizes the convolutions into tensor Tucker format and adaptively prunes the Tucker ranks of the convolutional kernel during training.
We obtain a robust training algorithm that provably approximates the full baseline performance and guarantees loss descent.
A variety of experiments against the full model and alternative low-rank baselines are implemented, showing that the proposed method drastically reduces the training costs, while achieving high performance, comparable to or better than the full baseline, and consistently outperforms competing low-rank approaches.
arXiv Detail & Related papers (2023-05-30T14:20:51Z) - Iterative Soft Shrinkage Learning for Efficient Image Super-Resolution [91.3781512926942]
Image super-resolution (SR) has witnessed extensive neural network designs from CNN to transformer architectures.
This work investigates the potential of network pruning for super-resolution iteration to take advantage of off-the-shelf network designs and reduce the underlying computational overhead.
We propose a novel Iterative Soft Shrinkage-Percentage (ISS-P) method by optimizing the sparse structure of a randomly network at each and tweaking unimportant weights with a small amount proportional to the magnitude scale on-the-fly.
arXiv Detail & Related papers (2023-03-16T21:06:13Z) - Layerwise Sparsifying Training and Sequential Learning Strategy for
Neural Architecture Adaptation [0.0]
This work presents a two-stage framework for developing neural architectures to adapt/ generalize well on a given training data set.
In the first stage, a manifold-regularized layerwise sparsifying training approach is adopted where a new layer is added each time and trained independently by freezing parameters in the previous layers.
In the second stage, a sequential learning process is adopted where a sequence of small networks is employed to extract information from the residual produced in stage I.
arXiv Detail & Related papers (2022-11-13T09:51:16Z) - Slimmable Networks for Contrastive Self-supervised Learning [67.21528544724546]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We present a one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - Defensive Tensorization [113.96183766922393]
We propose tensor defensiveization, an adversarial defence technique that leverages a latent high-order factorization of the network.
We empirically demonstrate the effectiveness of our approach on standard image classification benchmarks.
We validate the versatility of our approach across domains and low-precision architectures by considering an audio task and binary networks.
arXiv Detail & Related papers (2021-10-26T17:00:16Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.