Related papers: Parallel Layer Normalization for Universal Approximation

Parallel Layer Normalization for Universal Approximation

URL: http://arxiv.org/abs/2505.13142v1
Date: Mon, 19 May 2025 14:10:23 GMT
Title: Parallel Layer Normalization for Universal Approximation
Authors: Yunhao Ni, Yuhe Liu, Wenxin Sun, Yitong Tang, Yuxin Guo, Peilin Feng, Wenjun Wu, Lei Huang,
Abstract summary: Universal approximation theorem (UAT) is a fundamental theory for deep neural networks (DNNs)<n>We theoretically prove that an infinitely wide network has universal approximation capacity.<n>We identify that parallel layer normalization (PLN) can act as both activation function and normalization in deep neural networks.
Score: 6.723179803370419
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Universal approximation theorem (UAT) is a fundamental theory for deep neural networks (DNNs), demonstrating their powerful representation capacity to represent and approximate any function. The analyses and proofs of UAT are based on traditional network with only linear and nonlinear activation functions, but omitting normalization layers, which are commonly employed to enhance the training of modern networks. This paper conducts research on UAT of DNNs with normalization layers for the first time. We theoretically prove that an infinitely wide network -- composed solely of parallel layer normalization (PLN) and linear layers -- has universal approximation capacity. Additionally, we investigate the minimum number of neurons required to approximate $L$-Lipchitz continuous functions, with a single hidden-layer network. We compare the approximation capacity of PLN with traditional activation functions in theory. Different from the traditional activation functions, we identify that PLN can act as both activation function and normalization in deep neural networks at the same time. We also find that PLN can improve the performance when replacing LN in transformer architectures, which reveals the potential of PLN used in neural architectures.

Related papers

Neural Network Verification with Branch-and-Bound for General Nonlinearities [63.39918329535165]
Branch-and-bound (BaB) is among the most effective techniques for neural network (NN) verification.<n>We develop a general framework, named GenBaB, to conduct BaB on general nonlinearities to verify NNs with general architectures.<n>Our framework also allows the verification of general nonlinear graphs and enables verification applications beyond simple NNs.
arXiv Detail & Related papers (2024-05-31T17:51:07Z)
Analyzing the Neural Tangent Kernel of Periodically Activated Coordinate Networks [30.92757082348805]
We provide a theoretical understanding of periodically activated networks through an analysis of their Neural Tangent Kernel (NTK) Our findings indicate that periodically activated networks are textitnotably more well-behaved, from the NTK perspective, than ReLU activated networks.
arXiv Detail & Related papers (2024-02-07T12:06:52Z)
Improving the Expressive Power of Deep Neural Networks through Integral Activation Transform [12.36064367319084]
We generalize the traditional fully connected deep neural network (DNN) through the concept of continuous width. We show that IAT-ReLU exhibits a continuous activation pattern when continuous basis functions are employed. Our numerical experiments demonstrate that IAT-ReLU outperforms regular ReLU in terms of trainability and better smoothness.
arXiv Detail & Related papers (2023-12-19T20:23:33Z)
Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification. Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z)
Globally Optimal Training of Neural Networks with Threshold Activation Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations. We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z)
Exploring the Approximation Capabilities of Multiplicative Neural Networks for Smooth Functions [9.936974568429173]
We consider two classes of target functions: generalized bandlimited functions and Sobolev-Type balls. Our results demonstrate that multiplicative neural networks can approximate these functions with significantly fewer layers and neurons. These findings suggest that multiplicative gates can outperform standard feed-forward layers and have potential for improving neural network design.
arXiv Detail & Related papers (2023-01-11T17:57:33Z)
Extrapolation and Spectral Bias of Neural Nets with Hadamard Product: a Polynomial Net Study [55.12108376616355]
The study on NTK has been devoted to typical neural network architectures, but is incomplete for neural networks with Hadamard products (NNs-Hp) In this work, we derive the finite-width-K formulation for a special class of NNs-Hp, i.e., neural networks. We prove their equivalence to the kernel regression predictor with the associated NTK, which expands the application scope of NTK.
arXiv Detail & Related papers (2022-09-16T06:36:06Z)
On Feature Learning in Neural Networks with Global Convergence Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF) We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF. We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z)
Optimization Theory for ReLU Neural Networks Trained with Normalization Layers [82.61117235807606]
The success of deep neural networks in part due to the use of normalization layers. Our analysis shows how the introduction of normalization changes the landscape and can enable faster activation.
arXiv Detail & Related papers (2020-06-11T23:55:54Z)
On Approximation Capabilities of ReLU Activation and Softmax Output Layer in Neural Networks [6.852561400929072]
We prove that a sufficiently large neural network using the ReLU activation function can approximate any function in $L1$ up to any arbitrary precision. We also show that a large enough neural network using a nonlinear softmax output layer can also approximate any indicator function in $L1$.
arXiv Detail & Related papers (2020-02-10T19:48:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.