Related papers: Is Batch Norm unique? An empirical investigation and prescription to emulate the best properties of common normalizers without batch dependence

Is Batch Norm unique? An empirical investigation and prescription to emulate the best properties of common normalizers without batch dependence

URL: http://arxiv.org/abs/2010.10687v1
Date: Wed, 21 Oct 2020 00:41:38 GMT
Title: Is Batch Norm unique? An empirical investigation and prescription to emulate the best properties of common normalizers without batch dependence
Authors: Vinay Rao, Jascha Sohl-Dickstein
Abstract summary: We study the statistical properties of Batch Norm and other common normalizers. We propose two simple normalizers, PreLayerNorm and RegNorm, which better match these desirable properties without involving operations along the batch dimension.
Score: 33.07255026021875
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We perform an extensive empirical study of the statistical properties of Batch Norm and other common normalizers. This includes an examination of the correlation between representations of minibatches, gradient norms, and Hessian spectra both at initialization and over the course of training. Through this analysis, we identify several statistical properties which appear linked to Batch Norm's superior performance. We propose two simple normalizers, PreLayerNorm and RegNorm, which better match these desirable properties without involving operations along the batch dimension. We show that PreLayerNorm and RegNorm achieve much of the performance of Batch Norm without requiring batch dependence, that they reliably outperform LayerNorm, and that they can be applied in situations where Batch Norm is ineffective.

Related papers

Empirical Bound Information-Directed Sampling for Norm-Agnostic Bandits [0.0]
We introduce a novel frequentist IDS algorithm that iteratively refines a high-probability upper bound on the true parameter norm using accumulating data. We establish regret bounds for our algorithm that do not depend on an initially assumed parameter norm bound and demonstrate that our method outperforms state-of-the-art IDS and UCB algorithms.
arXiv Detail & Related papers (2025-03-07T02:33:37Z)
MANO: Exploiting Matrix Norm for Unsupervised Accuracy Estimation Under Distribution Shifts [25.643876327918544]
Leveraging the models' outputs, specifically the logits, is a common approach to estimating the test accuracy of a pre-trained neural network on out-of-distribution samples. Despite their ease of implementation and computational efficiency, current logit-based methods are vulnerable to overconfidence issues, leading to prediction bias. We propose MaNo which applies a data-dependent normalization on the logits to reduce prediction bias and takes the $L_p$ norm of the matrix of normalized logits as the estimation score.
arXiv Detail & Related papers (2024-05-29T10:45:06Z)
Iterative Regularization with k-support Norm: An Important Complement to Sparse Recovery [33.26163081551751]
We propose a novel iterative regularization algorithm, IRKSN, based on the $k$-support norm regularizer. We provide conditions for sparse recovery with IRKSN, and compare them with traditional conditions for recovery with $ell_1$ norm regularizers. We also give an early stopping bound on the model error of IRKSN with explicit constants, achieving the standard linear rate for sparse recovery.
arXiv Detail & Related papers (2023-12-19T09:21:08Z)
The Implicit Bias of Batch Normalization in Linear Models and Two-layer Linear Convolutional Neural Networks [117.93273337740442]
We show that gradient descent converges to a uniform margin classifier on the training data with an $exp(-Omega(log2 t))$ convergence rate. We also show that batch normalization has an implicit bias towards a patch-wise uniform margin.
arXiv Detail & Related papers (2023-06-20T16:58:00Z)
An Empirical Analysis of the Shift and Scale Parameters in BatchNorm [3.198144010381572]
Batch Normalization (BatchNorm) is a technique that improves the training of deep neural networks. This paper examines the relative contribution to the success of BatchNorm of the normalization step.
arXiv Detail & Related papers (2023-03-22T12:41:12Z)
Sample-Then-Optimize Batch Neural Thompson Sampling [50.800944138278474]
We introduce two algorithms for black-box optimization based on the Thompson sampling (TS) policy. To choose an input query, we only need to train an NN and then choose the query by maximizing the trained NN. Our algorithms sidestep the need to invert the large parameter matrix yet still preserve the validity of the TS policy.
arXiv Detail & Related papers (2022-10-13T09:01:58Z)
Contrastive Neural Ratio Estimation for Simulation-based Inference [15.354874711988662]
Likelihood-to-evidence ratio estimation is usually cast as either a binary (NRE-A) or a multiclass (NRE-B) classification task. In contrast to the binary classification framework, the current formulation of the multiclass version has an intrinsic and unknown bias term. We propose a multiclass framework free from the bias inherent to NRE-B at optimum, leaving us in the position to run diagnostics that practitioners depend on.
arXiv Detail & Related papers (2022-10-11T00:12:51Z)
Clipped Stochastic Methods for Variational Inequalities with Heavy-Tailed Noise [64.85879194013407]
We prove the first high-probability results with logarithmic dependence on the confidence level for methods for solving monotone and structured non-monotone VIPs. Our results match the best-known ones in the light-tails case and are novel for structured non-monotone problems. In addition, we numerically validate that the gradient noise of many practical formulations is heavy-tailed and show that clipping improves the performance of SEG/SGDA.
arXiv Detail & Related papers (2022-06-02T15:21:55Z)
Deconfounding Scores: Feature Representations for Causal Effect Estimation with Weak Overlap [140.98628848491146]
We introduce deconfounding scores, which induce better overlap without biasing the target of estimation. We show that deconfounding scores satisfy a zero-covariance condition that is identifiable in observed data. In particular, we show that this technique could be an attractive alternative to standard regularizations.
arXiv Detail & Related papers (2021-04-12T18:50:11Z)
Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks. We introduce a new scoring method that casts a plausibility ranking task in a full-text format. We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
Separating the Effects of Batch Normalization on CNN Training Speed and Stability Using Classical Adaptive Filter Theory [40.55789598448379]
Batch Normalization (BatchNorm) is commonly used in Convolutional Neural Networks (CNNs) to improve training speed and stability. This paper uses concepts from the traditional adaptive filter domain to provide insight into the dynamics and inner workings of BatchNorm.
arXiv Detail & Related papers (2020-02-25T05:25:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.