Related papers: Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms

Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms

URL: http://arxiv.org/abs/2403.00574v1
Date: Fri, 1 Mar 2024 14:55:22 GMT
Title: Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms
Authors: Toki Tahmid Inan, Mingrui Liu, Amarda Shehu
Abstract summary: This paper adopts a novel approach to deep learning optimization, focusing on gradient descent (SGD) and its variants. We show that SGD and its variants demonstrate performance on par with flat-minimas like SAM, albeit with half the gradient evaluations. Our study uncovers several key findings regarding the relationship between training loss and hold-out accuracy, as well as the comparable performance of SGD and noise-enabled variants.
Score: 13.134564730161983
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite an extensive body of literature on deep learning optimization, our current understanding of what makes an optimization algorithm effective is fragmented. In particular, we do not understand well whether enhanced optimization translates to improved generalizability. Current research overlooks the inherent stochastic nature of stochastic gradient descent (SGD) and its variants, resulting in a lack of comprehensive benchmarking and insight into their statistical performance. This paper aims to address this gap by adopting a novel approach. Rather than solely evaluating the endpoint of individual optimization trajectories, we draw from an ensemble of trajectories to estimate the stationary distribution of stochastic optimizers. Our investigation encompasses a wide array of techniques, including SGD and its variants, flat-minima optimizers, and new algorithms we propose under the Basin Hopping framework. Through our evaluation, which encompasses synthetic functions with known minima and real-world problems in computer vision and natural language processing, we emphasize fair benchmarking under a statistical framework, comparing stationary distributions and establishing statistical significance. Our study uncovers several key findings regarding the relationship between training loss and hold-out accuracy, as well as the comparable performance of SGD, noise-enabled variants, and novel optimizers utilizing the BH framework. Notably, these algorithms demonstrate performance on par with flat-minima optimizers like SAM, albeit with half the gradient evaluations. We anticipate that our work will catalyze further exploration in deep learning optimization, encouraging a shift away from single-model approaches towards methodologies that acknowledge and leverage the stochastic nature of optimizers.

Related papers

Divergence Minimization Preference Optimization for Diffusion Model Alignment [58.651951388346525]
Divergence Minimization Preference Optimization (DMPO) is a principled method for aligning diffusion models by minimizing reverse KL divergence.<n>Our results show that diffusion models fine-tuned with DMPO can consistently outperform or match existing techniques.<n>DMPO unlocks a robust and elegant pathway for preference alignment, bridging principled theory with practical performance in diffusion models.
arXiv Detail & Related papers (2025-07-10T07:57:30Z)
An Experimental Approach for Running-Time Estimation of Multi-objective Evolutionary Algorithms in Numerical Optimization [16.66619776655723]
We propose an experimental approach for estimating upper bounds on the running time of MOEAs without algorithmic assumptions.<n>We conduct comprehensive experiments on five representative MOEAs using the ZDT and DTLZ benchmark suites.<n>Results demonstrate the effectiveness of our approach in estimating upper bounds on the running time without requiring algorithmic or problem simplifications.
arXiv Detail & Related papers (2025-07-03T07:06:14Z)
Explicit and Implicit Graduated Optimization in Deep Neural Networks [0.6906005491572401]
This paper experimentally evaluates the performance of an explicit graduated optimization algorithm with an optimal noise scheduling. In addition, it demonstrates its effectiveness through experiments on image classification tasks with ResNet architectures.
arXiv Detail & Related papers (2024-12-16T07:23:22Z)
Parameter-Free Algorithms for Performative Regret Minimization under Decision-Dependent Distributions [15.396561118589577]
performative risk minimization is a formulation of optimization under decision-dependent distributions. Our algorithms significantly improve upon existing Lipschitz constant distribution parameter-based methods. We provide experimental results that demonstrate the numerical superiority of our algorithms over the existing method and other black-box optimistic optimization methods.
arXiv Detail & Related papers (2024-02-23T08:36:28Z)
Federated Conditional Stochastic Optimization [110.513884892319]
Conditional optimization has found in a wide range of machine learning tasks, such as in-variant learning tasks, AUPRC, andAML. This paper proposes algorithms for distributed federated learning.
arXiv Detail & Related papers (2023-10-04T01:47:37Z)
PAO: A general particle swarm algorithm with exact dynamics and closed-form transition densities [0.0]
Particle swarm optimisation (PSO) approaches have proven to be highly effective in a number of application areas. In this work, a highly-general, interpretable variant of the PSO algorithm -- particle attractor algorithm (PAO) -- is proposed.
arXiv Detail & Related papers (2023-04-28T16:19:27Z)
An Empirical Evaluation of Zeroth-Order Optimization Methods on AI-driven Molecule Optimization [78.36413169647408]
We study the effectiveness of various ZO optimization methods for optimizing molecular objectives. We show the advantages of ZO sign-based gradient descent (ZO-signGD) We demonstrate the potential effectiveness of ZO optimization methods on widely used benchmark tasks from the Guacamol suite.
arXiv Detail & Related papers (2022-10-27T01:58:10Z)
B\'ezier Flow: a Surface-wise Gradient Descent Method for Multi-objective Optimization [12.487037582320804]
We extend the stability of optimization algorithms in the sense of Probability Approximately Correct (PAC) learning. We show that multi-objective optimization algorithms derived from a gradient descent-based single-objective optimization algorithm are PAC stable.
arXiv Detail & Related papers (2022-05-23T07:47:58Z)
Amortized Implicit Differentiation for Stochastic Bilevel Optimization [53.12363770169761]
We study a class of algorithms for solving bilevel optimization problems in both deterministic and deterministic settings. We exploit a warm-start strategy to amortize the estimation of the exact gradient. By using this framework, our analysis shows these algorithms to match the computational complexity of methods that have access to an unbiased estimate of the gradient.
arXiv Detail & Related papers (2021-11-29T15:10:09Z)
Outlier-Robust Sparse Estimation via Non-Convex Optimization [73.18654719887205]
We explore the connection between high-dimensional statistics and non-robust optimization in the presence of sparsity constraints. We develop novel and simple optimization formulations for these problems. As a corollary, we obtain that any first-order method that efficiently converges to station yields an efficient algorithm for these tasks.
arXiv Detail & Related papers (2021-09-23T17:38:24Z)
Optimum-statistical Collaboration Towards General and Efficient Black-box Optimization [23.359363844344408]
We introduce an algorithm framework of managing the interaction between optimization error flux and statistical error flux evolving in the optimization process. Our framework and its analysis can be applied to a large family of functions and partitions that satisfy different local smoothness assumptions. In theory, we prove the algorithm enjoys rate-optimal regret bounds under different local smoothness assumptions.
arXiv Detail & Related papers (2021-06-17T02:37:39Z)
Bilevel Optimization: Convergence Analysis and Enhanced Design [63.64636047748605]
Bilevel optimization is a tool for many machine learning problems. We propose a novel stoc-efficientgradient estimator named stoc-BiO.
arXiv Detail & Related papers (2020-10-15T18:09:48Z)
Global Optimization of Gaussian processes [52.77024349608834]
We propose a reduced-space formulation with trained Gaussian processes trained on few data points. The approach also leads to significantly smaller and computationally cheaper sub solver for lower bounding. In total, we reduce time convergence by orders of orders of the proposed method.
arXiv Detail & Related papers (2020-05-21T20:59:11Z)
Learning to be Global Optimizer [28.88646928299302]
We learn an optimal network and escaping capability algorithm for some benchmark functions. We show that the learned algorithm significantly outperforms some well-known classical optimization algorithms.
arXiv Detail & Related papers (2020-03-10T03:46:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.