Related papers: Agnostic Learning of General ReLU Activation Using Gradient Descent

Agnostic Learning of General ReLU Activation Using Gradient Descent

URL: http://arxiv.org/abs/2208.02711v2
Date: Mon, 04 Nov 2024 02:43:49 GMT
Title: Agnostic Learning of General ReLU Activation Using Gradient Descent
Authors: Pranjal Awasthi, Alex Tang, Aravindan Vijayaraghavan,
Abstract summary: We consider the more challenging scenario when the bias of the ReLU function is non-zero. A ReLU function that achieves an error that is within a constant factor of the optimal error of the best ReLU function with moderate bias is found.
Score: 38.28136172081834
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We provide a convergence analysis of gradient descent for the problem of agnostically learning a single ReLU function with moderate bias under Gaussian distributions. Unlike prior work that studies the setting of zero bias, we consider the more challenging scenario when the bias of the ReLU function is non-zero. Our main result establishes that starting from random initialization, in a polynomial number of iterations gradient descent outputs, with high probability, a ReLU function that achieves an error that is within a constant factor of the optimal error of the best ReLU function with moderate bias. We also provide finite sample guarantees, and these techniques generalize to a broader class of marginal distributions beyond Gaussians.

Related papers

Error Feedback under $(L_0,L_1)$-Smoothness: Normalization and Momentum [56.37522020675243]
We provide the first proof of convergence for normalized error feedback algorithms across a wide range of machine learning problems. We show that due to their larger allowable stepsizes, our new normalized error feedback algorithms outperform their non-normalized counterparts on various tasks.
arXiv Detail & Related papers (2024-10-22T10:19:27Z)
The effect of Leaky ReLUs on the training and generalization of overparameterized networks [12.630316710142413]
We show that $alpha =-1$, which corresponds to the absolute value activation function, is optimal for the training error bound. Numerical experiments empirically support the practical choices guided by the theory.
arXiv Detail & Related papers (2024-02-19T08:30:06Z)
Non-asymptotic Analysis of Biased Adaptive Stochastic Approximation [0.8192907805418583]
We show that biased gradients converge to critical points for smooth non- functions. We show how the effect of bias can be reduced by appropriate tuning.
arXiv Detail & Related papers (2024-02-05T10:17:36Z)
Experimental Design for Linear Functionals in Reproducing Kernel Hilbert Spaces [102.08678737900541]
We provide algorithms for constructing bias-aware designs for linear functionals. We derive non-asymptotic confidence sets for fixed and adaptive designs under sub-Gaussian noise.
arXiv Detail & Related papers (2022-05-26T20:56:25Z)
On the inability of Gaussian process regression to optimally learn compositional functions [3.6525095710982916]
Deep Gaussian process priors can outperform Gaussian process priors if the target function has a compositional structure. We show that if the true function is a generalized additive function, then the posterior based on any mean-zero Gaussian process can only recover the truth at a rate that is strictly slower than the minimax rate.
arXiv Detail & Related papers (2022-05-16T15:42:25Z)
Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient for Out-of-Distribution Generalization [52.7137956951533]
We argue that devising simpler methods for learning predictors on existing features is a promising direction for future research. We introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift. Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions.
arXiv Detail & Related papers (2022-02-14T16:42:16Z)
On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD) We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting. We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z)
Information-Theoretic Generalization Bounds for Stochastic Gradient Descent [13.757095663704858]
We provide bounds on the technical error that depend on local statistics. Key factors are the objective variance of the gradients, the smoothness of the gradients, and the sensitivity of the loss function to perturbations. Our key tool is combining the information-theoretic generalization bounds previously used for analyzing randomized variants of SGD with perturbation analysis of the paths.
arXiv Detail & Related papers (2021-02-01T16:00:34Z)
On the Convergence of SGD with Biased Gradients [28.400751656818215]
We analyze the guiding domain of biased gradient methods (SGD), where individual updates are corrupted by compression. We quantify how many magnitudes of bias accuracy and convergence rates are impacted.
arXiv Detail & Related papers (2020-07-31T19:37:59Z)
Approximation Schemes for ReLU Regression [80.33702497406632]
We consider the fundamental problem of ReLU regression. The goal is to output the best fitting ReLU with respect to square loss given to draws from some unknown distribution.
arXiv Detail & Related papers (2020-05-26T16:26:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.