Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study
- URL: http://arxiv.org/abs/2506.03931v1
- Date: Wed, 04 Jun 2025 13:25:04 GMT
- Title: Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study
- Authors: Yotam Alexander, Yonatan Slutzky, Yuval Ran-Milo, Nadav Cohen,
- Abstract summary: We investigate the validity of the volume hypothesis for wide and deep neural networks.<n>We first prove that generalization under Guess & Check deteriorates with increasing width, establishing what is, to our knowledge, the first case where G&C is provably inferior to gradient descent.<n> Conversely, we prove that generalization under G&C improves with increasing depth, revealing a stark contrast between wide and deep networks.
- Score: 7.823526894538709
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conventional wisdom attributes the mysterious generalization abilities of overparameterized neural networks to gradient descent (and its variants). The recent volume hypothesis challenges this view: it posits that these generalization abilities persist even when gradient descent is replaced by Guess & Check (G&C), i.e., by drawing weight settings until one that fits the training data is found. The validity of the volume hypothesis for wide and deep neural networks remains an open question. In this paper, we theoretically investigate this question for matrix factorization (with linear and non-linear activation)--a common testbed in neural network theory. We first prove that generalization under G&C deteriorates with increasing width, establishing what is, to our knowledge, the first case where G&C is provably inferior to gradient descent. Conversely, we prove that generalization under G&C improves with increasing depth, revealing a stark contrast between wide and deep networks, which we further validate empirically. These findings suggest that even in simple settings, there may not be a simple answer to the question of whether neural networks need gradient descent to generalize well.
Related papers
- A Near Complete Nonasymptotic Generalization Theory For Multilayer Neural Networks: Beyond the Bias-Variance Tradeoff [57.25901375384457]
We propose a nonasymptotic generalization theory for multilayer neural networks with arbitrary Lipschitz activations and general Lipschitz loss functions.<n>In particular, it doens't require the boundness of loss function, as commonly assumed in the literature.<n>We show the near minimax optimality of our theory for multilayer ReLU networks for regression problems.
arXiv Detail & Related papers (2025-03-03T23:34:12Z) - Deep Grokking: Would Deep Neural Networks Generalize Better? [51.24007462968805]
Grokking refers to a sharp rise of the network's generalization accuracy on the test set.
We find that deep neural networks can be more susceptible to grokking than its shallower counterparts.
We also observe an intriguing multi-stage generalization phenomenon when increase the depth of the model.
arXiv Detail & Related papers (2024-05-29T19:05:11Z) - Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - A Scalable Walsh-Hadamard Regularizer to Overcome the Low-degree
Spectral Bias of Neural Networks [79.28094304325116]
Despite the capacity of neural nets to learn arbitrary functions, models trained through gradient descent often exhibit a bias towards simpler'' functions.
We show how this spectral bias towards low-degree frequencies can in fact hurt the neural network's generalization on real-world datasets.
We propose a new scalable functional regularization scheme that aids the neural network to learn higher degree frequencies.
arXiv Detail & Related papers (2023-05-16T20:06:01Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - On the optimization and generalization of overparameterized implicit
neural networks [25.237054775800164]
Implicit neural networks have become increasingly attractive in the machine learning community.
We show that global convergence is guaranteed, even if only the implicit layer is trained.
This paper investigates the generalization error for implicit neural networks.
arXiv Detail & Related papers (2022-09-30T16:19:46Z) - A global convergence theory for deep ReLU implicit networks via
over-parameterization [26.19122384935622]
Implicit deep learning has received increasing attention recently.
This paper analyzes the gradient flow of Rectified Linear Unit (ReLU) activated implicit neural networks.
arXiv Detail & Related papers (2021-10-11T23:22:50Z) - On the Implicit Biases of Architecture & Gradient Descent [46.34988166338264]
This paper finds that while typical networks that fit the training data already generalise fairly well, gradient descent can further improve generalisation by selecting networks with a large margin.
New technical tools suggest a nuanced portrait of generalisation involving both the implicit biases of architecture and gradient descent.
arXiv Detail & Related papers (2021-10-08T17:36:37Z) - Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks.
We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.