Related papers: Logit Attenuating Weight Normalization

Logit Attenuating Weight Normalization

URL: http://arxiv.org/abs/2108.05839v1
Date: Thu, 12 Aug 2021 16:44:24 GMT
Title: Logit Attenuating Weight Normalization
Authors: Aman Gupta, Rohan Ramanath, Jun Shi, Anika Ramachandran, Sirou Zhou, Mingzhou Zhou, S. Sathiya Keerthi
Abstract summary: deep networks trained using gradient-based generalizations are a popular choice for solving classification and ranking problems. Without appropriately tuned $ell$ regularization or weight decay, such networks have the tendency to make output scores (logits) and network weights large. We propose a method called Logituating Weight Normalization (LAWN), that can be stacked onto any gradient-based generalization.
Score: 5.856897366207895
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Over-parameterized deep networks trained using gradient-based optimizers are a popular choice for solving classification and ranking problems. Without appropriately tuned $\ell_2$ regularization or weight decay, such networks have the tendency to make output scores (logits) and network weights large, causing training loss to become too small and the network to lose its adaptivity (ability to move around) in the parameter space. Although regularization is typically understood from an overfitting perspective, we highlight its role in making the network more adaptive and enabling it to escape more easily from weights that generalize poorly. To provide such a capability, we propose a method called Logit Attenuating Weight Normalization (LAWN), that can be stacked onto any gradient-based optimizer. LAWN controls the logits by constraining the weight norms of layers in the final homogeneous sub-network. Empirically, we show that the resulting LAWN variant of the optimizer makes a deep network more adaptive to finding minimas with superior generalization performance on large-scale image classification and recommender systems. While LAWN is particularly impressive in improving Adam, it greatly improves all optimizers when used with large batch sizes

Related papers

FedNAR: Federated Optimization with Normalized Annealing Regularization [54.42032094044368]
We explore the choices of weight decay and identify that weight decay value appreciably influences the convergence of existing FL algorithms. We develop Federated optimization with Normalized Annealing Regularization (FedNAR), a plug-in that can be seamlessly integrated into any existing FL algorithms.
arXiv Detail & Related papers (2023-10-04T21:11:40Z)
Weight Compander: A Simple Weight Reparameterization for Regularization [5.744133015573047]
We introduce weight compander, a novel effective method to improve generalization of deep neural networks. We show experimentally that using weight compander in addition to standard regularization methods improves the performance of neural networks.
arXiv Detail & Related papers (2023-06-29T14:52:04Z)
Combining Explicit and Implicit Regularization for Efficient Learning in Deep Networks [3.04585143845864]
In deep linear networks, gradient descent implicitly regularizes toward low-rank solutions on matrix completion/factorization tasks. We propose an explicit penalty to mirror this implicit bias which only takes effect with certain adaptive gradient generalizations. This combination can enable a single-layer network to achieve low-rank approximations with degenerate error comparable to deep linear networks.
arXiv Detail & Related papers (2023-06-01T04:47:17Z)
Iterative Soft Shrinkage Learning for Efficient Image Super-Resolution [91.3781512926942]
Image super-resolution (SR) has witnessed extensive neural network designs from CNN to transformer architectures. This work investigates the potential of network pruning for super-resolution iteration to take advantage of off-the-shelf network designs and reduce the underlying computational overhead. We propose a novel Iterative Soft Shrinkage-Percentage (ISS-P) method by optimizing the sparse structure of a randomly network at each and tweaking unimportant weights with a small amount proportional to the magnitude scale on-the-fly.
arXiv Detail & Related papers (2023-03-16T21:06:13Z)
Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models. We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers. A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z)
Adaptive Low-Rank Regularization with Damping Sequences to Restrict Lazy Weights in Deep Networks [13.122543280692641]
This paper detects a subset of the weighting layers that cause overfitting. The overfitting recognizes by matrix and tensor condition numbers. An adaptive regularization scheme entitled Adaptive Low-Rank (ALR) is proposed that converges a subset of the weighting layers to their Low-Rank Factorization (LRF) The experimental results show that ALR regularizes the deep networks well with high training speed and low resource usage.
arXiv Detail & Related papers (2021-06-17T17:28:14Z)
Rethinking Skip Connection with Layer Normalization in Transformers and ResNets [49.87919454950763]
Skip connection is a widely-used technique to improve the performance of deep neural networks. In this work, we investigate how the scale factors in the effectiveness of the skip connection.
arXiv Detail & Related papers (2021-05-15T11:44:49Z)
Layer-adaptive sparsity for the Magnitude-based Pruning [88.37510230946478]
We propose a novel importance score for global pruning, coined layer-adaptive magnitude-based pruning (LAMP) score. LAMP consistently outperforms popular existing schemes for layerwise sparsity selection.
arXiv Detail & Related papers (2020-10-15T09:14:02Z)
Improve Generalization and Robustness of Neural Networks via Weight Scale Shifting Invariant Regularizations [52.493315075385325]
We show that a family of regularizers, including weight decay, is ineffective at penalizing the intrinsic norms of weights for networks with homogeneous activation functions. We propose an improved regularizer that is invariant to weight scale shifting and thus effectively constrains the intrinsic norm of a neural network.
arXiv Detail & Related papers (2020-08-07T02:55:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.