Related papers: Batch Normalization Decomposed

Batch Normalization Decomposed

URL: http://arxiv.org/abs/2412.02843v1
Date: Tue, 03 Dec 2024 21:18:27 GMT
Title: Batch Normalization Decomposed
Authors: Ido Nachum, Marco Bondaschi, Michael Gastpar, Anatoly Khina,
Abstract summary: A neural network layer with batch normalization comprises three components that affect the representation induced by the network.<n>In our work, we present an analysis of the other two key components of networks with batch normalization, namely, the recentering and the non-linearity.
Score: 21.226713936233423
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: \emph{Batch normalization} is a successful building block of neural network architectures. Yet, it is not well understood. A neural network layer with batch normalization comprises three components that affect the representation induced by the network: \emph{recentering} the mean of the representation to zero, \emph{rescaling} the variance of the representation to one, and finally applying a \emph{non-linearity}. Our work follows the work of Hadi Daneshmand, Amir Joudaki, Francis Bach [NeurIPS~'21], which studied deep \emph{linear} neural networks with only the rescaling stage between layers at initialization. In our work, we present an analysis of the other two key components of networks with batch normalization, namely, the recentering and the non-linearity. When these two components are present, we observe a curious behavior at initialization. Through the layers, the representation of the batch converges to a single cluster except for an odd data point that breaks far away from the cluster in an orthogonal direction. We shed light on this behavior from two perspectives: (1) we analyze the geometrical evolution of a simplified indicative model; (2) we prove a stability result for the aforementioned~configuration.

Related papers

Emergence of Globally Attracting Fixed Points in Deep Neural Networks With Nonlinear Activations [24.052411316664017]
We introduce a theoretical framework for the evolution of the kernel sequence, which measures the similarity between the hidden representation for two different inputs. For nonlinear activations, the kernel sequence converges globally to a unique fixed point, which can correspond to similar representations depending on the activation and network architecture. This work provides new insights into the implicit biases of deep neural networks and how architectural choices influence the evolution of representations across layers.
arXiv Detail & Related papers (2024-10-26T07:10:47Z)
Differential Equation Scaling Limits of Shaped and Unshaped Neural Networks [8.716913598251386]
We find similar differential equation based characterization for two types of unshaped networks. We derive the first order correction to the layerwise correlation. These results together provide a connection between shaped and unshaped network architectures.
arXiv Detail & Related papers (2023-10-18T16:15:10Z)
From Complexity to Clarity: Analytical Expressions of Deep Neural Network Weights via Clifford's Geometric Algebra and Convexity [54.01594785269913]
We show that optimal weights of deep ReLU neural networks are given by the wedge product of training samples when trained with standard regularized loss. The training problem reduces to convex optimization over wedge product features, which encode the geometric structure of the training dataset.
arXiv Detail & Related papers (2023-09-28T15:19:30Z)
Improved Convergence Guarantees for Shallow Neural Networks [91.3755431537592]
We prove convergence of depth 2 neural networks, trained via gradient descent, to a global minimum. Our model has the following features: regression with quadratic loss function, fully connected feedforward architecture, RelU activations, Gaussian data instances, adversarial labels. They strongly suggest that, at least in our model, the convergence phenomenon extends well beyond the NTK regime''
arXiv Detail & Related papers (2022-12-05T14:47:52Z)
Global Convergence Analysis of Deep Linear Networks with A One-neuron Layer [18.06634056613645]
We consider optimizing deep linear networks which have a layer with one neuron under quadratic loss. We describe the convergent point of trajectories with arbitrary starting point under flow. We show specific convergence rates of trajectories that converge to the global gradientr by stages.
arXiv Detail & Related papers (2022-01-08T04:44:59Z)
Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent. We show that SGD is biased towards a simple solution. We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z)
Dynamic Inference with Neural Interpreters [72.90231306252007]
We present Neural Interpreters, an architecture that factorizes inference in a self-attention network as a system of modules. inputs to the model are routed through a sequence of functions in a way that is end-to-end learned. We show that Neural Interpreters perform on par with the vision transformer using fewer parameters, while being transferrable to a new task in a sample efficient manner.
arXiv Detail & Related papers (2021-10-12T23:22:45Z)
Neural networks behave as hash encoders: An empirical study [79.38436088982283]
The input space of a neural network with ReLU-like activations is partitioned into multiple linear regions. We demonstrate that this partition exhibits the following encoding properties across a variety of deep learning models. Simple algorithms, such as $K$-Means, $K$-NN, and logistic regression, can achieve fairly good performance on both training and test data.
arXiv Detail & Related papers (2021-01-14T07:50:40Z)
Implicit Geometric Regularization for Learning Shapes [34.052738965233445]
We offer a new paradigm for computing high fidelity implicit neural representations directly from raw data. We show that our method leads to state of the art implicit neural representations with higher level-of-details and fidelity compared to previous methods.
arXiv Detail & Related papers (2020-02-24T07:36:32Z)
Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.