Beyond Single-Model Views for Deep Learning: Optimization versus
Generalizability of Stochastic Optimization Algorithms
- URL: http://arxiv.org/abs/2403.00574v1
- Date: Fri, 1 Mar 2024 14:55:22 GMT
- Title: Beyond Single-Model Views for Deep Learning: Optimization versus
Generalizability of Stochastic Optimization Algorithms
- Authors: Toki Tahmid Inan, Mingrui Liu, Amarda Shehu
- Abstract summary: This paper adopts a novel approach to deep learning optimization, focusing on gradient descent (SGD) and its variants.
We show that SGD and its variants demonstrate performance on par with flat-minimas like SAM, albeit with half the gradient evaluations.
Our study uncovers several key findings regarding the relationship between training loss and hold-out accuracy, as well as the comparable performance of SGD and noise-enabled variants.
- Score: 13.134564730161983
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite an extensive body of literature on deep learning optimization, our
current understanding of what makes an optimization algorithm effective is
fragmented. In particular, we do not understand well whether enhanced
optimization translates to improved generalizability. Current research
overlooks the inherent stochastic nature of stochastic gradient descent (SGD)
and its variants, resulting in a lack of comprehensive benchmarking and insight
into their statistical performance. This paper aims to address this gap by
adopting a novel approach. Rather than solely evaluating the endpoint of
individual optimization trajectories, we draw from an ensemble of trajectories
to estimate the stationary distribution of stochastic optimizers. Our
investigation encompasses a wide array of techniques, including SGD and its
variants, flat-minima optimizers, and new algorithms we propose under the Basin
Hopping framework. Through our evaluation, which encompasses synthetic
functions with known minima and real-world problems in computer vision and
natural language processing, we emphasize fair benchmarking under a statistical
framework, comparing stationary distributions and establishing statistical
significance. Our study uncovers several key findings regarding the
relationship between training loss and hold-out accuracy, as well as the
comparable performance of SGD, noise-enabled variants, and novel optimizers
utilizing the BH framework. Notably, these algorithms demonstrate performance
on par with flat-minima optimizers like SAM, albeit with half the gradient
evaluations. We anticipate that our work will catalyze further exploration in
deep learning optimization, encouraging a shift away from single-model
approaches towards methodologies that acknowledge and leverage the stochastic
nature of optimizers.
Related papers
- Parameter-Free Algorithms for Performative Regret Minimization under
Decision-Dependent Distributions [15.396561118589577]
performative risk minimization is a formulation of optimization under decision-dependent distributions.
Our algorithms significantly improve upon existing Lipschitz constant distribution parameter-based methods.
We provide experimental results that demonstrate the numerical superiority of our algorithms over the existing method and other black-box optimistic optimization methods.
arXiv Detail & Related papers (2024-02-23T08:36:28Z) - Federated Conditional Stochastic Optimization [110.513884892319]
Conditional optimization has found in a wide range of machine learning tasks, such as in-variant learning tasks, AUPRC, andAML.
This paper proposes algorithms for distributed federated learning.
arXiv Detail & Related papers (2023-10-04T01:47:37Z) - PAO: A general particle swarm algorithm with exact dynamics and
closed-form transition densities [0.0]
Particle swarm optimisation (PSO) approaches have proven to be highly effective in a number of application areas.
In this work, a highly-general, interpretable variant of the PSO algorithm -- particle attractor algorithm (PAO) -- is proposed.
arXiv Detail & Related papers (2023-04-28T16:19:27Z) - An Empirical Evaluation of Zeroth-Order Optimization Methods on
AI-driven Molecule Optimization [78.36413169647408]
We study the effectiveness of various ZO optimization methods for optimizing molecular objectives.
We show the advantages of ZO sign-based gradient descent (ZO-signGD)
We demonstrate the potential effectiveness of ZO optimization methods on widely used benchmark tasks from the Guacamol suite.
arXiv Detail & Related papers (2022-10-27T01:58:10Z) - B\'ezier Flow: a Surface-wise Gradient Descent Method for
Multi-objective Optimization [12.487037582320804]
We extend the stability of optimization algorithms in the sense of Probability Approximately Correct (PAC) learning.
We show that multi-objective optimization algorithms derived from a gradient descent-based single-objective optimization algorithm are PAC stable.
arXiv Detail & Related papers (2022-05-23T07:47:58Z) - Amortized Implicit Differentiation for Stochastic Bilevel Optimization [53.12363770169761]
We study a class of algorithms for solving bilevel optimization problems in both deterministic and deterministic settings.
We exploit a warm-start strategy to amortize the estimation of the exact gradient.
By using this framework, our analysis shows these algorithms to match the computational complexity of methods that have access to an unbiased estimate of the gradient.
arXiv Detail & Related papers (2021-11-29T15:10:09Z) - Outlier-Robust Sparse Estimation via Non-Convex Optimization [73.18654719887205]
We explore the connection between high-dimensional statistics and non-robust optimization in the presence of sparsity constraints.
We develop novel and simple optimization formulations for these problems.
As a corollary, we obtain that any first-order method that efficiently converges to station yields an efficient algorithm for these tasks.
arXiv Detail & Related papers (2021-09-23T17:38:24Z) - Optimum-statistical Collaboration Towards General and Efficient
Black-box Optimization [23.359363844344408]
We introduce an algorithm framework of managing the interaction between optimization error flux and statistical error flux evolving in the optimization process.
Our framework and its analysis can be applied to a large family of functions and partitions that satisfy different local smoothness assumptions.
In theory, we prove the algorithm enjoys rate-optimal regret bounds under different local smoothness assumptions.
arXiv Detail & Related papers (2021-06-17T02:37:39Z) - Bilevel Optimization: Convergence Analysis and Enhanced Design [63.64636047748605]
Bilevel optimization is a tool for many machine learning problems.
We propose a novel stoc-efficientgradient estimator named stoc-BiO.
arXiv Detail & Related papers (2020-10-15T18:09:48Z) - Global Optimization of Gaussian processes [52.77024349608834]
We propose a reduced-space formulation with trained Gaussian processes trained on few data points.
The approach also leads to significantly smaller and computationally cheaper sub solver for lower bounding.
In total, we reduce time convergence by orders of orders of the proposed method.
arXiv Detail & Related papers (2020-05-21T20:59:11Z) - Learning to be Global Optimizer [28.88646928299302]
We learn an optimal network and escaping capability algorithm for some benchmark functions.
We show that the learned algorithm significantly outperforms some well-known classical optimization algorithms.
arXiv Detail & Related papers (2020-03-10T03:46:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.