Agnostic Learning of General ReLU Activation Using Gradient Descent
- URL: http://arxiv.org/abs/2208.02711v2
- Date: Mon, 04 Nov 2024 02:43:49 GMT
- Title: Agnostic Learning of General ReLU Activation Using Gradient Descent
- Authors: Pranjal Awasthi, Alex Tang, Aravindan Vijayaraghavan,
- Abstract summary: We consider the more challenging scenario when the bias of the ReLU function is non-zero.
A ReLU function that achieves an error that is within a constant factor of the optimal error of the best ReLU function with moderate bias is found.
- Score: 38.28136172081834
- License:
- Abstract: We provide a convergence analysis of gradient descent for the problem of agnostically learning a single ReLU function with moderate bias under Gaussian distributions. Unlike prior work that studies the setting of zero bias, we consider the more challenging scenario when the bias of the ReLU function is non-zero. Our main result establishes that starting from random initialization, in a polynomial number of iterations gradient descent outputs, with high probability, a ReLU function that achieves an error that is within a constant factor of the optimal error of the best ReLU function with moderate bias. We also provide finite sample guarantees, and these techniques generalize to a broader class of marginal distributions beyond Gaussians.
Related papers
- Error Feedback under $(L_0,L_1)$-Smoothness: Normalization and Momentum [56.37522020675243]
We provide the first proof of convergence for normalized error feedback algorithms across a wide range of machine learning problems.
We show that due to their larger allowable stepsizes, our new normalized error feedback algorithms outperform their non-normalized counterparts on various tasks.
arXiv Detail & Related papers (2024-10-22T10:19:27Z) - Non-asymptotic Analysis of Biased Adaptive Stochastic Approximation [0.8192907805418583]
We show that biased gradients converge to critical points for smooth non- functions.
We show how the effect of bias can be reduced by appropriate tuning.
arXiv Detail & Related papers (2024-02-05T10:17:36Z) - Experimental Design for Linear Functionals in Reproducing Kernel Hilbert
Spaces [102.08678737900541]
We provide algorithms for constructing bias-aware designs for linear functionals.
We derive non-asymptotic confidence sets for fixed and adaptive designs under sub-Gaussian noise.
arXiv Detail & Related papers (2022-05-26T20:56:25Z) - On the inability of Gaussian process regression to optimally learn
compositional functions [3.6525095710982916]
Deep Gaussian process priors can outperform Gaussian process priors if the target function has a compositional structure.
We show that if the true function is a generalized additive function, then the posterior based on any mean-zero Gaussian process can only recover the truth at a rate that is strictly slower than the minimax rate.
arXiv Detail & Related papers (2022-05-16T15:42:25Z) - Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient
for Out-of-Distribution Generalization [52.7137956951533]
We argue that devising simpler methods for learning predictors on existing features is a promising direction for future research.
We introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift.
Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions.
arXiv Detail & Related papers (2022-02-14T16:42:16Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Information-Theoretic Generalization Bounds for Stochastic Gradient
Descent [13.757095663704858]
We provide bounds on the technical error that depend on local statistics.
Key factors are the objective variance of the gradients, the smoothness of the gradients, and the sensitivity of the loss function to perturbations.
Our key tool is combining the information-theoretic generalization bounds previously used for analyzing randomized variants of SGD with perturbation analysis of the paths.
arXiv Detail & Related papers (2021-02-01T16:00:34Z) - On the Convergence of SGD with Biased Gradients [28.400751656818215]
We analyze the guiding domain of biased gradient methods (SGD), where individual updates are corrupted by compression.
We quantify how many magnitudes of bias accuracy and convergence rates are impacted.
arXiv Detail & Related papers (2020-07-31T19:37:59Z) - Approximation Schemes for ReLU Regression [80.33702497406632]
We consider the fundamental problem of ReLU regression.
The goal is to output the best fitting ReLU with respect to square loss given to draws from some unknown distribution.
arXiv Detail & Related papers (2020-05-26T16:26:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.