Black-Box Generalization
- URL: http://arxiv.org/abs/2202.06880v1
- Date: Mon, 14 Feb 2022 17:14:48 GMT
- Title: Black-Box Generalization
- Authors: Konstantinos E. Nikolakakis, Farzin Haddadpour, Dionysios S.
Kalogerias and Amin Karbasi
- Abstract summary: We provide the first error analysis for black-box learning through derivative generalization.
We show both generalization are independent $d$, $K$ and under appropriate choices a slightly decreased learning rate.
- Score: 31.80268332522017
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We provide the first generalization error analysis for black-box learning
through derivative-free optimization. Under the assumption of a Lipschitz and
smooth unknown loss, we consider the Zeroth-order Stochastic Search (ZoSS)
algorithm, that updates a $d$-dimensional model by replacing stochastic
gradient directions with stochastic differences of $K+1$ perturbed loss
evaluations per dataset (example) query. For both unbounded and bounded
possibly nonconvex losses, we present the first generalization bounds for the
ZoSS algorithm. These bounds coincide with those for SGD, and rather
surprisingly are independent of $d$, $K$ and the batch size $m$, under
appropriate choices of a slightly decreased learning rate. For bounded
nonconvex losses and a batch size $m=1$, we additionally show that both
generalization error and learning rate are independent of $d$ and $K$, and
remain essentially the same as for the SGD, even for two function evaluations.
Our results extensively extend and consistently recover established results for
SGD in prior work, on both generalization bounds and corresponding learning
rates. If additionally $m=n$, where $n$ is the dataset size, we derive
generalization guarantees for full-batch GD as well.
Related papers
- MGDA Converges under Generalized Smoothness, Provably [27.87166415148172]
Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning.
Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard $L$-smooth or bounded-gradient assumptions.
We study a more general and realistic class of generalized $ell$-smooth loss functions, where $ell$ is a general non-decreasing function of gradient norm.
arXiv Detail & Related papers (2024-05-29T18:36:59Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Adam-like Algorithm with Smooth Clipping Attains Global Minima: Analysis
Based on Ergodicity of Functional SDEs [0.0]
We show that an Adam-type algorithm with clipping the globalized non--1 loss function minimizes the regularized non--1 error form.
We also apply the ergodic theory of smooth groups to investigate approaches to learn inverse temperature and time.
arXiv Detail & Related papers (2023-11-29T14:38:59Z) - Topology-aware Generalization of Decentralized SGD [89.25765221779288]
This paper studies the generalizability of decentralized Valpha-10 stability descent (D-SGD)
We prove that the generalizability of D-SGD has a positive correlation with connectivity in initial training phase.
arXiv Detail & Related papers (2022-06-25T16:03:48Z) - Generalization Bounds for Gradient Methods via Discrete and Continuous
Prior [8.76346911214414]
We show a new high probability generalization bound of order $O(frac1n + fracL2n2sum_t=1T(gamma_t/varepsilon_t)2)$ for gradient Langevin Dynamics (GLD)
We can also obtain new bounds for certain variants of SGD.
arXiv Detail & Related papers (2022-05-27T07:23:01Z) - What Happens after SGD Reaches Zero Loss? --A Mathematical Framework [35.31946061894308]
Understanding the implicit bias of Gradient Descent (SGD) is one of the key challenges in deep learning.
This paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991).
It yields some new results: (1) a global analysis of the implicit bias valid for $eta-2$ steps, in contrast to the local analysis of Blanc et al. ( 2020) that is only valid for $eta-1.6$ steps and (2) allowing arbitrary noise covariance.
arXiv Detail & Related papers (2021-10-13T17:50:46Z) - High-probability Bounds for Non-Convex Stochastic Optimization with
Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails.
We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z) - Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z) - Gradient-Based Empirical Risk Minimization using Local Polynomial
Regression [39.29885444997579]
A major goal of this literature has been to compare different algorithms, such as gradient descent (GD) or gradient descent (SGD)
We demonstrate that when the loss function is smooth in the data, we can learn the oracle at every iteration and beat the oracle complexities of both GD and SGD.
arXiv Detail & Related papers (2020-11-04T20:10:31Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - Private Stochastic Non-Convex Optimization: Adaptive Algorithms and
Tighter Generalization Bounds [72.63031036770425]
We propose differentially private (DP) algorithms for bound non-dimensional optimization.
We demonstrate two popular deep learning methods on the empirical advantages over standard gradient methods.
arXiv Detail & Related papers (2020-06-24T06:01:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.