Optimal Rates for Generalization of Gradient Descent for Deep ReLU Classification
- URL: http://arxiv.org/abs/2510.02779v1
- Date: Fri, 03 Oct 2025 07:22:36 GMT
- Title: Optimal Rates for Generalization of Gradient Descent for Deep ReLU Classification
- Authors: Yuanfan Li, Yunwen Lei, Zheng-Chu Guo, Yiming Ying,
- Abstract summary: We establish optimal generalization rates for gradient descent networks with deep ReLU networks by carefully trading off optimization and generalization errors.<n>A key technical contribution is our novel control of activation patterns near a reference model, enabling a sharper Rademacher complexity bound for deep ReLU networks trained with gradient descent.
- Score: 29.11075530919662
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances have significantly improved our understanding of the generalization performance of gradient descent (GD) methods in deep neural networks. A natural and fundamental question is whether GD can achieve generalization rates comparable to the minimax optimal rates established in the kernel setting. Existing results either yield suboptimal rates of $O(1/\sqrt{n})$, or focus on networks with smooth activation functions, incurring exponential dependence on network depth $L$. In this work, we establish optimal generalization rates for GD with deep ReLU networks by carefully trading off optimization and generalization errors, achieving only polynomial dependence on depth. Specifically, under the assumption that the data are NTK separable from the margin $\gamma$, we prove an excess risk rate of $\widetilde{O}(L^4 (1 + \gamma L^2) / (n \gamma^2))$, which aligns with the optimal SVM-type rate $\widetilde{O}(1 / (n \gamma^2))$ up to depth-dependent factors. A key technical contribution is our novel control of activation patterns near a reference model, enabling a sharper Rademacher complexity bound for deep ReLU networks trained with gradient descent.
Related papers
- Decentralized Nonconvex Composite Federated Learning with Gradient Tracking and Momentum [78.27945336558987]
Decentralized server (DFL) eliminates reliance on client-client architecture.<n>Non-smooth regularization is often incorporated into machine learning tasks.<n>We propose a novel novel DNCFL algorithm to solve these problems.
arXiv Detail & Related papers (2025-04-17T08:32:25Z) - Local Loss Optimization in the Infinite Width: Stable Parameterization of Predictive Coding Networks and Target Propagation [8.35644084613785]
We introduce the maximal update parameterization ($mu$P) in the infinite-width limit for two representative designs of local targets.<n>By analyzing deep linear networks, we found that PC's gradients interpolate between first-order and Gauss-Newton-like gradients.<n>We demonstrate that, in specific standard settings, PC in the infinite-width limit behaves more similarly to the first-order gradient.
arXiv Detail & Related papers (2024-11-04T11:38:27Z) - Sharp Generalization for Nonparametric Regression in Interpolation Space by Over-Parameterized Neural Networks Trained with Preconditioned Gradient Descent and Early-Stopping [19.988762532185884]
We train a neural network with a novel Preconditioned Gradient Descent (PGD) algorithm.<n>We show that PGD achieves a sharp regression rate of $mathcal O(n-frac2alpha2alpha+1)$ when the target function is in the space $[mathcal H_K]s'$ with $s' ge 3$.
arXiv Detail & Related papers (2024-07-16T03:38:34Z) - Convex Relaxations of ReLU Neural Networks Approximate Global Optima in Polynomial Time [45.72323731094864]
In this paper, we study the optimality gap between two-layer ReLULU networks regularized with weight decay and their convex relaxations.
Our study sheds new light on understanding why local methods work well.
arXiv Detail & Related papers (2024-02-06T01:29:35Z) - Optimization dependent generalization bound for ReLU networks based on
sensitivity in the tangent bundle [0.0]
We propose a PAC type bound on the generalization error of feedforward ReLU networks.
The obtained bound does not explicitly depend on the depth of the network.
arXiv Detail & Related papers (2023-10-26T13:14:13Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Fixing the NTK: From Neural Network Linearizations to Exact Convex
Programs [63.768739279562105]
We show that for a particular choice of mask weights that do not depend on the learning targets, this kernel is equivalent to the NTK of the gated ReLU network on the training data.
A consequence of this lack of dependence on the targets is that the NTK cannot perform better than the optimal MKL kernel on the training set.
arXiv Detail & Related papers (2023-09-26T17:42:52Z) - Optimal Approximation and Learning Rates for Deep Convolutional Neural
Networks [17.075804626858748]
This paper focuses on approximation and learning performance analysis for deep convolutional neural networks with zero-padding and max-pooling.
We prove that, to approximate $r$-smooth function, the approximation rates of deep convolutional neural networks with depth $L$ are of order $ (L2/log L)-2r/d $, which is optimal up to a logarithmic factor.
arXiv Detail & Related papers (2023-08-07T02:37:02Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Mirror Descent Maximizes Generalized Margin and Can Be Implemented
Efficiently [34.438887960077025]
We show that $p$-$textsfGD$ can noticeably affect the structure and the generalization performance of the learned models.
We also show that $p$-$textsfGD$ is fully parallel in the same manner as SGD and can be used to train deep neural networks.
arXiv Detail & Related papers (2022-05-25T14:33:13Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Low-Pass Filtering SGD for Recovering Flat Optima in the Deep Learning
Optimization Landscape [15.362190838843915]
We show that LPF-SGD converges to a better optimal point with smaller generalization error than SGD.
We show that our algorithm achieves superior generalization performance compared to the common DL training strategies.
arXiv Detail & Related papers (2022-01-20T07:13:04Z) - MSE-Optimal Neural Network Initialization via Layer Fusion [68.72356718879428]
Deep neural networks achieve state-of-the-art performance for a range of classification and inference tasks.
The use of gradient combined nonvolutionity renders learning susceptible to novel problems.
We propose fusing neighboring layers of deeper networks that are trained with random variables.
arXiv Detail & Related papers (2020-01-28T18:25:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.