Initialization and Regularization of Factorized Neural Layers
- URL: http://arxiv.org/abs/2105.01029v1
- Date: Mon, 3 May 2021 17:28:07 GMT
- Title: Initialization and Regularization of Factorized Neural Layers
- Authors: Mikhail Khodak and Neil Tenenholtz and Lester Mackey and Nicol\`o Fusi
- Abstract summary: We show how to initialize and regularize Factorized layers in deep nets.
We show how these schemes lead to improved performance on both translation and unsupervised pre-training.
- Score: 23.875225732697142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Factorized layers--operations parameterized by products of two or more
matrices--occur in a variety of deep learning contexts, including compressed
model training, certain types of knowledge distillation, and multi-head
self-attention architectures. We study how to initialize and regularize deep
nets containing such layers, examining two simple, understudied schemes,
spectral initialization and Frobenius decay, for improving their performance.
The guiding insight is to design optimization routines for these networks that
are as close as possible to that of their well-tuned, non-decomposed
counterparts; we back this intuition with an analysis of how the initialization
and regularization schemes impact training with gradient descent, drawing on
modern attempts to understand the interplay of weight-decay and
batch-normalization. Empirically, we highlight the benefits of spectral
initialization and Frobenius decay across a variety of settings. In model
compression, we show that they enable low-rank methods to significantly
outperform both unstructured sparsity and tensor methods on the task of
training low-memory residual networks; analogs of the schemes also improve the
performance of tensor decomposition techniques. For knowledge distillation,
Frobenius decay enables a simple, overcomplete baseline that yields a compact
model from over-parameterized training without requiring retraining with or
pruning a teacher network. Finally, we show how both schemes applied to
multi-head attention lead to improved performance on both translation and
unsupervised pre-training.
Related papers
- Component-based Sketching for Deep ReLU Nets [55.404661149594375]
We develop a sketching scheme based on deep net components for various tasks.
We transform deep net training into a linear empirical risk minimization problem.
We show that the proposed component-based sketching provides almost optimal rates in approximating saturated functions.
arXiv Detail & Related papers (2024-09-21T15:30:43Z) - Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization.
A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR.
For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z) - Contrastive-Adversarial and Diffusion: Exploring pre-training and fine-tuning strategies for sulcal identification [3.0398616939692777]
Techniques like adversarial learning, contrastive learning, diffusion denoising learning, and ordinary reconstruction learning have become standard.
The study aims to elucidate the advantages of pre-training techniques and fine-tuning strategies to enhance the learning process of neural networks.
arXiv Detail & Related papers (2024-05-29T15:44:51Z) - Structure-Preserving Network Compression Via Low-Rank Induced Training Through Linear Layers Composition [11.399520888150468]
We present a theoretically-justified technique termed Low-Rank Induced Training (LoRITa)
LoRITa promotes low-rankness through the composition of linear layers and compresses by using singular value truncation.
We demonstrate the effectiveness of our approach using MNIST on Fully Connected Networks, CIFAR10 on Vision Transformers, and CIFAR10/100 and ImageNet on Convolutional Neural Networks.
arXiv Detail & Related papers (2024-05-06T00:58:23Z) - Stacking as Accelerated Gradient Descent [44.17524017365296]
Stacking is a technique for training deep residual networks by progressively increasing the number of layers.
We propose a theoretical explanation for the efficacy of stacking.
We prove that for certain deep linear residual networks, stacking does provide accelerated training.
arXiv Detail & Related papers (2024-03-08T01:23:25Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Uncovering mesa-optimization algorithms in Transformers [61.06055590704677]
Some autoregressive models can learn as an input sequence is processed, without undergoing any parameter changes, and without being explicitly trained to do so.
We show that standard next-token prediction error minimization gives rise to a subsidiary learning algorithm that adjusts the model as new inputs are revealed.
Our findings explain in-context learning as a product of autoregressive loss minimization and inform the design of new optimization-based Transformer layers.
arXiv Detail & Related papers (2023-09-11T22:42:50Z) - Geometry-aware training of factorized layers in tensor Tucker format [6.701651480567394]
We introduce a novel approach to train the factors of a Tucker decomposition of the weight tensors.
Our training proposal proves to be optimal in locally approximating the original unfactorized dynamics.
We provide a theoretical analysis of the algorithm, showing convergence, approximation and local descent guarantees.
arXiv Detail & Related papers (2023-05-30T14:20:51Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - DessiLBI: Exploring Structural Sparsity of Deep Networks via
Differential Inclusion Paths [45.947140164621096]
We propose a new approach based on differential inclusions of inverse scale spaces.
We show that DessiLBI unveils "winning tickets" in early epochs.
arXiv Detail & Related papers (2020-07-04T04:40:16Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.