Nonconvex Stochastic Bregman Proximal Gradient Method with Application to Deep Learning
- URL: http://arxiv.org/abs/2306.14522v5
- Date: Mon, 20 Jan 2025 00:26:58 GMT
- Title: Nonconvex Stochastic Bregman Proximal Gradient Method with Application to Deep Learning
- Authors: Kuangyu Ding, Jingyang Li, Kim-Chuan Toh,
- Abstract summary: quadratic methods for minimizing robustness for non composite objective functions typically rely on the Lipschitz smoothness of the differentiable part.<n>We propose a family of Bregman (SBPG) methods that only adaptivity.<n>MSBPG, a momentum-based variant, enhances convergence sensitivity by relaxing the minibatch size requirement.
- Score: 9.202586157819693
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Stochastic gradient methods for minimizing nonconvex composite objective functions typically rely on the Lipschitz smoothness of the differentiable part, but this assumption fails in many important problem classes like quadratic inverse problems and neural network training, leading to instability of the algorithms in both theory and practice. To address this, we propose a family of stochastic Bregman proximal gradient (SBPG) methods that only require smooth adaptivity. SBPG replaces the quadratic approximation in SGD with a Bregman proximity measure, offering a better approximation model that handles non-Lipschitz gradients in nonconvex objectives. We establish the convergence properties of vanilla SBPG and show it achieves optimal sample complexity in the nonconvex setting. Experimental results on quadratic inverse problems demonstrate SBPG's robustness in terms of stepsize selection and sensitivity to the initial point. Furthermore, we introduce a momentum-based variant, MSBPG, which enhances convergence by relaxing the mini-batch size requirement while preserving the optimal oracle complexity. We apply MSBPG to the training of deep neural networks, utilizing a polynomial kernel function to ensure smooth adaptivity of the loss function. Experimental results on benchmark datasets confirm the effectiveness and robustness of MSBPG in training neural networks. Given its negligible additional computational cost compared to SGD in large-scale optimization, MSBPG shows promise as a universal open-source optimizer for future applications.
Related papers
- Decentralized Nonconvex Composite Federated Learning with Gradient Tracking and Momentum [78.27945336558987]
Decentralized server (DFL) eliminates reliance on client-client architecture.
Non-smooth regularization is often incorporated into machine learning tasks.
We propose a novel novel DNCFL algorithm to solve these problems.
arXiv Detail & Related papers (2025-04-17T08:32:25Z) - SeWA: Selective Weight Average via Probabilistic Masking [51.015724517293236]
We show that only a few points are needed to achieve better and faster convergence.
We transform the discrete selection problem into a continuous subset optimization framework.
We derive the SeWA's stability bounds, which are sharper than that under both convex image checkpoints.
arXiv Detail & Related papers (2025-02-14T12:35:21Z) - A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning.
These problems are often formalized as Bi-Level optimizations (BLO)
We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z) - Stochastic Inexact Augmented Lagrangian Method for Nonconvex Expectation
Constrained Optimization [88.0031283949404]
Many real-world problems have complicated non functional constraints and use a large number of data points.
Our proposed method outperforms an existing method with the previously best-known result.
arXiv Detail & Related papers (2022-12-19T14:48:54Z) - Smoothing Policy Iteration for Zero-sum Markov Games [9.158672246275348]
We propose the smoothing policy robustness (SPI) algorithm to solve the zero-sum MGs approximately.
Specially, the adversarial policy is served as the weight function to enable an efficient sampling over action spaces.
We also propose a model-based algorithm called Smooth adversarial Actor-critic (SaAC) by extending SPI with the function approximations.
arXiv Detail & Related papers (2022-12-03T14:39:06Z) - Gradient-Free Methods for Deterministic and Stochastic Nonsmooth
Nonconvex Optimization [94.19177623349947]
Non-smooth non optimization problems emerge in machine learning and business making.
Two core challenges impede the development of efficient methods with finitetime convergence guarantee.
Two-phase versions of GFM and SGFM are also proposed and proven to achieve improved large-deviation results.
arXiv Detail & Related papers (2022-09-12T06:53:24Z) - High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise [51.31435087414348]
It is essential to theoretically guarantee that algorithms provide small objective residual with high probability.
Existing methods for non-smooth convex optimization have complexity bounds with dependence on confidence level.
We propose novel stepsize rules for two methods with gradient clipping.
arXiv Detail & Related papers (2021-06-10T17:54:21Z) - GOALS: Gradient-Only Approximations for Line Searches Towards Robust and
Consistent Training of Deep Neural Networks [0.0]
Mini-batch sub-sampling (MBSS) is favored in deep neural network training to reduce the computational cost.
We propose a gradient-only approximation line search (GOALS) with strong convergence characteristics with defined optimality criterion.
arXiv Detail & Related papers (2021-05-23T11:21:01Z) - A Retrospective Approximation Approach for Smooth Stochastic
Optimization [0.2867517731896504]
Gradient (SG) is the defactorandom iterative technique to solve optimization (SO) problems with a smooth (non-fimation) objective $imation.
arXiv Detail & Related papers (2021-03-07T16:29:36Z) - Global Convergence of Model Function Based Bregman Proximal Minimization
Algorithms [17.740376367999705]
Lipschitz mapping of a continuously differentiable function plays a crucial role in various optimization algorithms.
We propose a globally convergent algorithm called Model $$L$mad property.
arXiv Detail & Related papers (2020-12-24T08:09:22Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - Bayesian Sparse learning with preconditioned stochastic gradient MCMC
and its applications [5.660384137948734]
The proposed algorithm converges to the correct distribution with a controllable bias under mild conditions.
We show that the proposed algorithm canally converge to the correct distribution with a controllable bias under mild conditions.
arXiv Detail & Related papers (2020-06-29T20:57:20Z) - Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses [52.039438701530905]
We provide sharp upper and lower bounds for several forms of gradient descent (SGD) on arbitrary Lipschitz nonsmooth convex losses.
Our bounds allow us to derive a new algorithm for differentially private nonsmooth convex optimization with optimal excess population risk.
arXiv Detail & Related papers (2020-06-12T02:45:21Z) - Stochastic Proximal Gradient Algorithm with Minibatches. Application to
Large Scale Learning Models [2.384873896423002]
We develop and analyze minibatch variants of gradient algorithm for general composite objective functions with nonsmooth components.
We provide complexity for constant and variable stepsize iteration policies obtaining that, for minibatch size $N$, after $mathcalO(frac1Nepsilon)$ $epsilon-$subity is attained in expected quadratic distance to optimal solution.
arXiv Detail & Related papers (2020-03-30T10:43:56Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.