Logit Attenuating Weight Normalization
- URL: http://arxiv.org/abs/2108.05839v1
- Date: Thu, 12 Aug 2021 16:44:24 GMT
- Title: Logit Attenuating Weight Normalization
- Authors: Aman Gupta, Rohan Ramanath, Jun Shi, Anika Ramachandran, Sirou Zhou,
Mingzhou Zhou, S. Sathiya Keerthi
- Abstract summary: deep networks trained using gradient-based generalizations are a popular choice for solving classification and ranking problems.
Without appropriately tuned $ell$ regularization or weight decay, such networks have the tendency to make output scores (logits) and network weights large.
We propose a method called Logituating Weight Normalization (LAWN), that can be stacked onto any gradient-based generalization.
- Score: 5.856897366207895
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over-parameterized deep networks trained using gradient-based optimizers are
a popular choice for solving classification and ranking problems. Without
appropriately tuned $\ell_2$ regularization or weight decay, such networks have
the tendency to make output scores (logits) and network weights large, causing
training loss to become too small and the network to lose its adaptivity
(ability to move around) in the parameter space. Although regularization is
typically understood from an overfitting perspective, we highlight its role in
making the network more adaptive and enabling it to escape more easily from
weights that generalize poorly. To provide such a capability, we propose a
method called Logit Attenuating Weight Normalization (LAWN), that can be
stacked onto any gradient-based optimizer. LAWN controls the logits by
constraining the weight norms of layers in the final homogeneous sub-network.
Empirically, we show that the resulting LAWN variant of the optimizer makes a
deep network more adaptive to finding minimas with superior generalization
performance on large-scale image classification and recommender systems. While
LAWN is particularly impressive in improving Adam, it greatly improves all
optimizers when used with large batch sizes
Related papers
- FedNAR: Federated Optimization with Normalized Annealing Regularization [54.42032094044368]
We explore the choices of weight decay and identify that weight decay value appreciably influences the convergence of existing FL algorithms.
We develop Federated optimization with Normalized Annealing Regularization (FedNAR), a plug-in that can be seamlessly integrated into any existing FL algorithms.
arXiv Detail & Related papers (2023-10-04T21:11:40Z) - Weight Compander: A Simple Weight Reparameterization for Regularization [5.744133015573047]
We introduce weight compander, a novel effective method to improve generalization of deep neural networks.
We show experimentally that using weight compander in addition to standard regularization methods improves the performance of neural networks.
arXiv Detail & Related papers (2023-06-29T14:52:04Z) - Combining Explicit and Implicit Regularization for Efficient Learning in
Deep Networks [3.04585143845864]
In deep linear networks, gradient descent implicitly regularizes toward low-rank solutions on matrix completion/factorization tasks.
We propose an explicit penalty to mirror this implicit bias which only takes effect with certain adaptive gradient generalizations.
This combination can enable a single-layer network to achieve low-rank approximations with degenerate error comparable to deep linear networks.
arXiv Detail & Related papers (2023-06-01T04:47:17Z) - Iterative Soft Shrinkage Learning for Efficient Image Super-Resolution [91.3781512926942]
Image super-resolution (SR) has witnessed extensive neural network designs from CNN to transformer architectures.
This work investigates the potential of network pruning for super-resolution iteration to take advantage of off-the-shelf network designs and reduce the underlying computational overhead.
We propose a novel Iterative Soft Shrinkage-Percentage (ISS-P) method by optimizing the sparse structure of a randomly network at each and tweaking unimportant weights with a small amount proportional to the magnitude scale on-the-fly.
arXiv Detail & Related papers (2023-03-16T21:06:13Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - Adaptive Low-Rank Regularization with Damping Sequences to Restrict Lazy
Weights in Deep Networks [13.122543280692641]
This paper detects a subset of the weighting layers that cause overfitting. The overfitting recognizes by matrix and tensor condition numbers.
An adaptive regularization scheme entitled Adaptive Low-Rank (ALR) is proposed that converges a subset of the weighting layers to their Low-Rank Factorization (LRF)
The experimental results show that ALR regularizes the deep networks well with high training speed and low resource usage.
arXiv Detail & Related papers (2021-06-17T17:28:14Z) - Rethinking Skip Connection with Layer Normalization in Transformers and
ResNets [49.87919454950763]
Skip connection is a widely-used technique to improve the performance of deep neural networks.
In this work, we investigate how the scale factors in the effectiveness of the skip connection.
arXiv Detail & Related papers (2021-05-15T11:44:49Z) - Layer-adaptive sparsity for the Magnitude-based Pruning [88.37510230946478]
We propose a novel importance score for global pruning, coined layer-adaptive magnitude-based pruning (LAMP) score.
LAMP consistently outperforms popular existing schemes for layerwise sparsity selection.
arXiv Detail & Related papers (2020-10-15T09:14:02Z) - Improve Generalization and Robustness of Neural Networks via Weight
Scale Shifting Invariant Regularizations [52.493315075385325]
We show that a family of regularizers, including weight decay, is ineffective at penalizing the intrinsic norms of weights for networks with homogeneous activation functions.
We propose an improved regularizer that is invariant to weight scale shifting and thus effectively constrains the intrinsic norm of a neural network.
arXiv Detail & Related papers (2020-08-07T02:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.