Implicit Bias of Gradient Descent for Non-Homogeneous Deep Networks
- URL: http://arxiv.org/abs/2502.16075v1
- Date: Sat, 22 Feb 2025 04:40:45 GMT
- Title: Implicit Bias of Gradient Descent for Non-Homogeneous Deep Networks
- Authors: Yuhang Cai, Kangjie Zhou, Jingfeng Wu, Song Mei, Michael Lindsey, Peter L. Bartlett,
- Abstract summary: We establish the implicit bias of gradient descent (GD) for generic non-homogeneous deep networks under exponential loss.<n>Our results apply to a broad class of non-homogeneous networks satisfying a mild near-homogeneity condition.
- Score: 39.38604477897272
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We establish the asymptotic implicit bias of gradient descent (GD) for generic non-homogeneous deep networks under exponential loss. Specifically, we characterize three key properties of GD iterates starting from a sufficiently small empirical risk, where the threshold is determined by a measure of the network's non-homogeneity. First, we show that a normalized margin induced by the GD iterates increases nearly monotonically. Second, we prove that while the norm of the GD iterates diverges to infinity, the iterates themselves converge in direction. Finally, we establish that this directional limit satisfies the Karush-Kuhn-Tucker (KKT) conditions of a margin maximization problem. Prior works on implicit bias have focused exclusively on homogeneous networks; in contrast, our results apply to a broad class of non-homogeneous networks satisfying a mild near-homogeneity condition. In particular, our results apply to networks with residual connections and non-homogeneous activation functions, thereby resolving an open problem posed by Ji and Telgarsky (2020).
Related papers
- A Near Complete Nonasymptotic Generalization Theory For Multilayer Neural Networks: Beyond the Bias-Variance Tradeoff [57.25901375384457]
We propose a nonasymptotic generalization theory for multilayer neural networks with arbitrary Lipschitz activations and general Lipschitz loss functions.
In particular, it doens't require the boundness of loss function, as commonly assumed in the literature.
We show the near minimax optimality of our theory for multilayer ReLU networks for regression problems.
arXiv Detail & Related papers (2025-03-03T23:34:12Z) - The late-stage training dynamics of (stochastic) subgradient descent on homogeneous neural networks [2.1178416840822027]
We consider the setting of classification with homogeneous neural networks.<n>We show that normalized SGD iterates converge to the set of critical points of the normalized margin at late-stage training.
arXiv Detail & Related papers (2025-02-08T19:09:16Z) - Towards Training Without Depth Limits: Batch Normalization Without
Gradient Explosion [83.90492831583997]
We show that a batch-normalized network can keep the optimal signal propagation properties, but avoid exploding gradients in depth.
We use a Multi-Layer Perceptron (MLP) with linear activations and batch-normalization that provably has bounded depth.
We also design an activation shaping scheme that empirically achieves the same properties for certain non-linear activations.
arXiv Detail & Related papers (2023-10-03T12:35:02Z) - Convergence and concentration properties of constant step-size SGD
through Markov chains [0.0]
We consider the optimization of a smooth and strongly convex objective using constant step-size gradient descent (SGD)
We show that, for unbiased gradient estimates with mildly controlled variance, the iteration converges to an invariant distribution in total variation distance.
All our results are non-asymptotic and their consequences are discussed through a few applications.
arXiv Detail & Related papers (2023-06-20T12:36:28Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - On generalization bounds for deep networks based on loss surface
implicit regularization [5.68558935178946]
Modern deep neural networks generalize well despite a large number of parameters.
That modern deep neural networks generalize well despite a large number of parameters contradicts the classical statistical learning theory.
arXiv Detail & Related papers (2022-01-12T16:41:34Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Directional Convergence Analysis under Spherically Symmetric
Distribution [21.145823611499104]
We consider the fundamental problem of learning linear predictors (i.e., separable datasets with zero margin) using neural networks with gradient flow or gradient descent.
We show directional convergence guarantees with exact convergence rate for two-layer non-linear networks with only two hidden nodes, and (deep) linear networks.
arXiv Detail & Related papers (2021-05-09T08:59:58Z) - On the Convex Behavior of Deep Neural Networks in Relation to the
Layers' Width [99.24399270311069]
We observe that for wider networks, minimizing the loss with the descent optimization maneuvers through surfaces of positive curvatures at the start and end of training, and close to zero curvatures in between.
In other words, it seems that during crucial parts of the training process, the Hessian in wide networks is dominated by the component G.
arXiv Detail & Related papers (2020-01-14T16:30:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.