Generalization Properties of Stochastic Optimizers via Trajectory
Analysis
- URL: http://arxiv.org/abs/2108.00781v1
- Date: Mon, 2 Aug 2021 10:58:32 GMT
- Title: Generalization Properties of Stochastic Optimizers via Trajectory
Analysis
- Authors: Liam Hodgkinson, Umut \c{S}im\c{s}ekli, Rajiv Khanna, Michael W.
Mahoney
- Abstract summary: We show that both the Fernique-Talagrand functional and the local powerlaw are predictive of generalization performance.
We show that both our Fernique-Talagrand functional and the local powerlaw are predictive of generalization performance.
- Score: 48.38493838310503
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the ubiquitous use of stochastic optimization algorithms in machine
learning, the precise impact of these algorithms on generalization performance
in realistic non-convex settings is still poorly understood. In this paper, we
provide an encompassing theoretical framework for investigating the
generalization properties of stochastic optimizers, which is based on their
dynamics. We first prove a generalization bound attributable to the optimizer
dynamics in terms of the celebrated Fernique-Talagrand functional applied to
the trajectory of the optimizer. This data- and algorithm-dependent bound is
shown to be the sharpest possible in the absence of further assumptions. We
then specialize this result by exploiting the Markovian structure of stochastic
optimizers, deriving generalization bounds in terms of the (data-dependent)
transition kernels associated with the optimization algorithms. In line with
recent work that has revealed connections between generalization and
heavy-tailed behavior in stochastic optimization, we link the generalization
error to the local tail behavior of the transition kernels. We illustrate that
the local power-law exponent of the kernel acts as an effective dimension,
which decreases as the transitions become "less Gaussian". We support our
theory with empirical results from a variety of neural networks, and we show
that both the Fernique-Talagrand functional and the local power-law exponent
are predictive of generalization performance.
Related papers
- The Unified Balance Theory of Second-Moment Exponential Scaling Optimizers in Visual Tasks [4.309676284145538]
We suggest that SGD and adaptives can be unified under a broader inference.
We conducted tests on some classic datasets and networks to confirm the impact of different balance coefficients on the overall training process.
arXiv Detail & Related papers (2024-05-28T18:09:22Z) - Promises and Pitfalls of the Linearized Laplace in Bayesian Optimization [73.80101701431103]
The linearized-Laplace approximation (LLA) has been shown to be effective and efficient in constructing Bayesian neural networks.
We study the usefulness of the LLA in Bayesian optimization and highlight its strong performance and flexibility.
arXiv Detail & Related papers (2023-04-17T14:23:43Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - Revisiting Optimal Convergence Rate for Smooth and Non-convex Stochastic
Decentralized Optimization [25.831902182404388]
Decentralized optimization is effective to save communication in large-scale machine learning.
This paper develops an optimal convergence algorithm with general weight matrices.
arXiv Detail & Related papers (2022-10-14T14:34:32Z) - Exploring the Algorithm-Dependent Generalization of AUPRC Optimization
with List Stability [107.65337427333064]
optimization of the Area Under the Precision-Recall Curve (AUPRC) is a crucial problem for machine learning.
In this work, we present the first trial in the single-dependent generalization of AUPRC optimization.
Experiments on three image retrieval datasets on speak to the effectiveness and soundness of our framework.
arXiv Detail & Related papers (2022-09-27T09:06:37Z) - Fractal Structure and Generalization Properties of Stochastic
Optimization Algorithms [71.62575565990502]
We prove that the generalization error of an optimization algorithm can be bounded on the complexity' of the fractal structure that underlies its generalization measure.
We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden/layered neural networks) and algorithms.
arXiv Detail & Related papers (2021-06-09T08:05:36Z) - Directed particle swarm optimization with Gaussian-process-based
function forecasting [15.733136147164032]
Particle swarm optimization (PSO) is an iterative search method that moves a set of candidate solution around a search-space towards the best known global and local solutions with randomized step lengths.
We show that our algorithm attains desirable properties for exploratory and exploitative behavior.
arXiv Detail & Related papers (2021-02-08T13:02:57Z) - Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear.
We show that it commonly arises in parameters of discrete multiplicative noise due to variance.
A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.