Related papers: The Spectral Bias of Shallow Neural Network Learning is Shaped by the Choice of Non-linearity

The Spectral Bias of Shallow Neural Network Learning is Shaped by the Choice of Non-linearity

URL: http://arxiv.org/abs/2503.10587v1
Date: Thu, 13 Mar 2025 17:36:46 GMT
Title: The Spectral Bias of Shallow Neural Network Learning is Shaped by the Choice of Non-linearity
Authors: Justin Sahs, Ryan Pyle, Fabio Anselmi, Ankit Patel,
Abstract summary: We study how non-linear activation functions contribute to shaping neural networks' implicit bias.<n>We show that local dynamical attractors facilitate the formation of clusters of hyperplanes where the input to a neuron's activation function is zero.
Score: 0.7499722271664144
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite classical statistical theory predicting severe overfitting, modern massively overparameterized neural networks still generalize well. This unexpected property is attributed to the network's so-called implicit bias, which describes its propensity to converge to solutions that generalize effectively, among the many possible that correctly label the training data. The aim of our research is to explore this bias from a new perspective, focusing on how non-linear activation functions contribute to shaping it. First, we introduce a reparameterization which removes a continuous weight rescaling symmetry. Second, in the kernel regime, we leverage this reparameterization to generalize recent findings that relate shallow Neural Networks to the Radon transform, deriving an explicit formula for the implicit bias induced by a broad class of activation functions. Specifically, by utilizing the connection between the Radon transform and the Fourier transform, we interpret the kernel regime's inductive bias as minimizing a spectral seminorm that penalizes high-frequency components, in a manner dependent on the activation function. Finally, in the adaptive regime, we demonstrate the existence of local dynamical attractors that facilitate the formation of clusters of hyperplanes where the input to a neuron's activation function is zero, yielding alignment between many neurons' response functions. We confirm these theoretical results with simulations. All together, our work provides a deeper understanding of the mechanisms underlying the generalization capabilities of overparameterized neural networks and its relation with the implicit bias, offering potential pathways for designing more efficient and robust models.

Related papers

Analysis of Fourier Neural Operators via Effective Field Theory [1.7102697561186413]
We present the first systematic effective-field-theory analysis of FNOs in an infinite-dimensional function space.<n>We show that nonlinear activations inevitably couple frequency inputs to high-frequency modes that are otherwise discarded by spectral truncation.
arXiv Detail & Related papers (2025-07-29T14:10:46Z)
Langevin Flows for Modeling Neural Latent Dynamics [81.81271685018284]
We introduce LangevinFlow, a sequential Variational Auto-Encoder where the time evolution of latent variables is governed by the underdamped Langevin equation.<n>Our approach incorporates physical priors -- such as inertia, damping, a learned potential function, and forces -- to represent both autonomous and non-autonomous processes in neural systems.<n>Our method outperforms state-of-the-art baselines on synthetic neural populations generated by a Lorenz attractor.
arXiv Detail & Related papers (2025-07-15T17:57:48Z)
Uncertainty propagation in feed-forward neural network models [3.987067170467799]
We develop new uncertainty propagation methods for feed-forward neural network architectures. We derive analytical expressions for the probability density function (PDF) of the neural network output. A key finding is that an appropriate linearization of the leaky ReLU activation function yields accurate statistical results.
arXiv Detail & Related papers (2025-03-27T00:16:36Z)
On the Asymptotic Learning Curves of Kernel Ridge Regression under Power-law Decay [17.306230523610864]
We show that the 'benign overfitting phenomenon' exists in very wide neural networks only when the noise level is small. Our results suggest that the phenomenon exists in very wide neural networks only when the noise level is small.
arXiv Detail & Related papers (2023-09-23T11:18:13Z)
Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification. Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z)
A Scalable Walsh-Hadamard Regularizer to Overcome the Low-degree Spectral Bias of Neural Networks [79.28094304325116]
Despite the capacity of neural nets to learn arbitrary functions, models trained through gradient descent often exhibit a bias towards simpler'' functions. We show how this spectral bias towards low-degree frequencies can in fact hurt the neural network's generalization on real-world datasets. We propose a new scalable functional regularization scheme that aids the neural network to learn higher degree frequencies.
arXiv Detail & Related papers (2023-05-16T20:06:01Z)
On the Activation Function Dependence of the Spectral Bias of Neural Networks [0.0]
We study the phenomenon from the point of view of the spectral bias of neural networks. We provide a theoretical explanation for the spectral bias of ReLU neural networks by leveraging connections with the theory of finite element methods. We show that neural networks with the Hat activation function are trained significantly faster using gradient descent and ADAM.
arXiv Detail & Related papers (2022-08-09T17:40:57Z)
Momentum Diminishes the Effect of Spectral Bias in Physics-Informed Neural Networks [72.09574528342732]
Physics-informed neural network (PINN) algorithms have shown promising results in solving a wide range of problems involving partial differential equations (PDEs) They often fail to converge to desirable solutions when the target function contains high-frequency features, due to a phenomenon known as spectral bias. In the present work, we exploit neural tangent kernels (NTKs) to investigate the training dynamics of PINNs evolving under gradient descent with momentum (SGDM)
arXiv Detail & Related papers (2022-06-29T19:03:10Z)
Phenomenology of Double Descent in Finite-Width Neural Networks [29.119232922018732]
Double descent delineates the behaviour of models depending on the regime they belong to. We use influence functions to derive suitable expressions of the population loss and its lower bound. Building on our analysis, we investigate how the loss function affects double descent.
arXiv Detail & Related papers (2022-03-14T17:39:49Z)
Modeling Implicit Bias with Fuzzy Cognitive Maps [0.0]
This paper presents a Fuzzy Cognitive Map model to quantify implicit bias in structured datasets. We introduce a new reasoning mechanism equipped with a normalization-like transfer function that prevents neurons from saturating.
arXiv Detail & Related papers (2021-12-23T17:04:12Z)
Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task. This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z)
Provably Efficient Neural Estimation of Structural Equation Model: An Adversarial Approach [144.21892195917758]
We study estimation in a class of generalized Structural equation models (SEMs) We formulate the linear operator equation as a min-max game, where both players are parameterized by neural networks (NNs), and learn the parameters of these neural networks using a gradient descent. For the first time we provide a tractable estimation procedure for SEMs based on NNs with provable convergence and without the need for sample splitting.
arXiv Detail & Related papers (2020-07-02T17:55:47Z)
Measuring Model Complexity of Neural Networks with Curve Activation Functions [100.98319505253797]
We propose the linear approximation neural network (LANN) to approximate a given deep model with curve activation function. We experimentally explore the training process of neural networks and detect overfitting. We find that the $L1$ and $L2$ regularizations suppress the increase of model complexity.
arXiv Detail & Related papers (2020-06-16T07:38:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.