Related papers: On the Nonlinearity of Layer Normalization

On the Nonlinearity of Layer Normalization

URL: http://arxiv.org/abs/2406.01255v1
Date: Mon, 3 Jun 2024 12:11:34 GMT
Title: On the Nonlinearity of Layer Normalization
Authors: Yunhao Ni, Yuxin Guo, Junlong Jia, Lei Huang,
Abstract summary: We investigate the representation capacity of a network with layerwise composition of linear and LN transformations, referred to as LN-Net. We show that, given $m$ samples with any label assignment, an LN-Net with only 3 neurons in each layer and $O(m)$ LN layers can correctly classify them.
Score: 5.0464797863553414
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Layer normalization (LN) is a ubiquitous technique in deep learning but our theoretical understanding to it remains elusive. This paper investigates a new theoretical direction for LN, regarding to its nonlinearity and representation capacity. We investigate the representation capacity of a network with layerwise composition of linear and LN transformations, referred to as LN-Net. We theoretically show that, given $m$ samples with any label assignment, an LN-Net with only 3 neurons in each layer and $O(m)$ LN layers can correctly classify them. We further show the lower bound of the VC dimension of an LN-Net. The nonlinearity of LN can be amplified by group partition, which is also theoretically demonstrated with mild assumption and empirically supported by our experiments. Based on our analyses, we consider to design neural architecture by exploiting and amplifying the nonlinearity of LN, and the effectiveness is supported by our experiments.

Related papers

Peri-LN: Revisiting Layer Normalization in the Transformer Architecture [57.08322913112157]
Pre-LN and Post-LN have long dominated standard practices despite their limitations in large-scale training. Several open-source large-scale models have recently begun silently adopting a third strategy without much explanation. We show that Peri-LN consistently achieves more balanced variance growth, steadier gradient flow, and convergence stability.
arXiv Detail & Related papers (2025-02-04T21:29:47Z)
Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations [54.17275171325324]
We present a counterexample to the Linear Representation Hypothesis (LRH) When trained to repeat an input token sequence, neural networks learn to represent the token at each position with a particular order of magnitude, rather than a direction. These findings strongly indicate that interpretability research should not be confined to the LRH.
arXiv Detail & Related papers (2024-08-20T15:04:37Z)
Neural Network Verification with Branch-and-Bound for General Nonlinearities [63.39918329535165]
Branch-and-bound (BaB) is among the most effective techniques for neural network (NN) verification. We develop a general framework, named GenBaB, to conduct BaB on general nonlinearities to verify NNs with general architectures. We demonstrate the effectiveness of our GenBaB on verifying a wide range of NNs, including NNs with activation functions such as Sigmoid, Tanh, Sine and GeLU.
arXiv Detail & Related papers (2024-05-31T17:51:07Z)
Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification. Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z)
Extrapolation and Spectral Bias of Neural Nets with Hadamard Product: a Polynomial Net Study [55.12108376616355]
The study on NTK has been devoted to typical neural network architectures, but is incomplete for neural networks with Hadamard products (NNs-Hp) In this work, we derive the finite-width-K formulation for a special class of NNs-Hp, i.e., neural networks. We prove their equivalence to the kernel regression predictor with the associated NTK, which expands the application scope of NTK.
arXiv Detail & Related papers (2022-09-16T06:36:06Z)
On Feature Learning in Neural Networks with Global Convergence Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF) We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF. We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z)
Explicitising The Implicit Intrepretability of Deep Neural Networks Via Duality [5.672223170618133]
Recent work by Lakshminarayanan and Singh provided a dual view for fully connected deep neural networks (DNNs) with rectified linear units (ReLU)
arXiv Detail & Related papers (2022-03-01T03:08:21Z)
On the Equivalence between Neural Network and Support Vector Machine [23.174679357972984]
The dynamics of an infinitely wide neural network (NN) trained by gradient descent can be characterized by Tangent Neural Kernel (NTK) We establish the equivalence between NN and support vector machine (SVM) Our main theoretical results include establishing the equivalence between NN and a broad family of $ell$ regularized KMs with finite-width bounds.
arXiv Detail & Related papers (2021-11-11T06:05:00Z)
Disentangling deep neural networks with rectified linear units using duality [4.683806391173103]
We propose a novel interpretable counterpart of deep neural networks (DNNs) with rectified linear units (ReLUs) We show that convolution with global pooling and skip connection provide respectively rotational invariance and ensemble structure to the neural path kernel (NPK)
arXiv Detail & Related papers (2021-10-06T16:51:59Z)
A Survey of Label-noise Representation Learning: Past, Present and Future [172.28865582415628]
Label-Noise Representation Learning (LNRL) methods can robustly train deep models with noisy labels. LNRL methods can be classified into three directions: instance-dependent LNRL, adversarial LNRL, and new datasets. We envision potential directions beyond LNRL, such as learning with feature-noise, preference-noise, domain-noise, similarity-noise, graph-noise and demonstration-noise.
arXiv Detail & Related papers (2020-11-09T13:16:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.