Related papers: Stochastic Average Gradient : A Simple Empirical Investigation

Stochastic Average Gradient : A Simple Empirical Investigation

URL: http://arxiv.org/abs/2310.12771v1
Date: Thu, 27 Jul 2023 17:34:26 GMT
Title: Stochastic Average Gradient : A Simple Empirical Investigation
Authors: Pascal Junior Tikeng Notsawo
Abstract summary: Average gradient (SAG) is a method for optimizing the sum of a finite number of smooth functions. SAG converges faster than other iterations on simple toy problems and performs better than many other iterations on simple machine learning problems. We also propose a combination of SAG with the momentum algorithm and Adam.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the recent growth of theoretical studies and empirical successes of neural networks, gradient backpropagation is still the most widely used algorithm for training such networks. On the one hand, we have deterministic or full gradient (FG) approaches that have a cost proportional to the amount of training data used but have a linear convergence rate, and on the other hand, stochastic gradient (SG) methods that have a cost independent of the size of the dataset, but have a less optimal convergence rate than the determinist approaches. To combine the cost of the stochastic approach with the convergence rate of the deterministic approach, a stochastic average gradient (SAG) has been proposed. SAG is a method for optimizing the sum of a finite number of smooth convex functions. Like SG methods, the SAG method's iteration cost is independent of the number of terms in the sum. In this work, we propose to compare SAG to some standard optimizers used in machine learning. SAG converges faster than other optimizers on simple toy problems and performs better than many other optimizers on simple machine learning problems. We also propose a combination of SAG with the momentum algorithm and Adam. These combinations allow empirically higher speed and obtain better performance than the other methods, especially when the landscape of the function to optimize presents obstacles or is ill-conditioned.

Related papers

Beyond First-Order: Training LLMs with Stochastic Conjugate Subgradients and AdamW [2.028622227373579]
gradient-based descent (SGD) have long been central to training large language models (LLMs)<n>This paper proposes a conjugate subgradient method together with adaptive sampling specifically for training LLMs.
arXiv Detail & Related papers (2025-07-01T23:30:15Z)
Variance Reduction Methods Do Not Need to Compute Full Gradients: Improved Efficiency through Shuffling [44.31966204357333]
We develop memory-efficient algorithms for large-scale machine learning problems. We use two key techniques to make our approach memory-efficient and avoid full computations.
arXiv Detail & Related papers (2025-02-20T15:37:45Z)
The Stochastic Conjugate Subgradient Algorithm For Kernel Support Vector Machines [1.738375118265695]
This paper proposes an innovative method specifically designed for kernel support vector machines (SVMs) It not only achieves faster iteration per iteration but also exhibits enhanced convergence when compared to conventional SFO techniques. Our experimental results demonstrate that the proposed algorithm not only maintains but potentially exceeds the scalability of SFO methods.
arXiv Detail & Related papers (2024-07-30T17:03:19Z)
Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses [5.052293146674794]
It is known that the standard descent (SGD) optimization method, as well as accelerated and adaptive SGD optimization methods such as the Adam fail to converge if the learning rates do not converge to zero. In this work we propose and study a learning-rate-adaptive approach for SGD optimization methods in which the learning rate is adjusted based on empirical estimates.
arXiv Detail & Related papers (2024-06-20T14:07:39Z)
Federated Conditional Stochastic Optimization [110.513884892319]
Conditional optimization has found in a wide range of machine learning tasks, such as in-variant learning tasks, AUPRC, andAML. This paper proposes algorithms for distributed federated learning.
arXiv Detail & Related papers (2023-10-04T01:47:37Z)
Faster One-Sample Stochastic Conditional Gradient Method for Composite Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms. The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z)
Stochastic Reweighted Gradient Descent [4.355567556995855]
We propose an importance-sampling-based algorithm we call SRG (stochastic reweighted gradient) We pay particular attention to the time and memory overhead of our proposed method. We present empirical results to support our findings.
arXiv Detail & Related papers (2021-03-23T04:09:43Z)
Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information. We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z)
Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step. Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z)
Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error. Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)
Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite Epochs [25.158203665218164]
We show that adaptive gradient methods can be faster than random shuffling SGD after finite time. To the best of our knowledge, it is the first to demonstrate that adaptive gradient methods can be faster than SGD after finite time.
arXiv Detail & Related papers (2020-06-12T09:39:47Z)
The Strength of Nesterov's Extrapolation in the Individual Convergence of Nonsmooth Optimization [0.0]
We prove that Nesterov's extrapolation has the strength to make the individual convergence of gradient descent methods optimal for nonsmooth problems. We give an extension of the derived algorithms to solve regularized learning tasks with nonsmooth losses in settings. Our method is applicable as an efficient tool for solving large-scale $l$1-regularized hinge-loss learning problems.
arXiv Detail & Related papers (2020-06-08T03:35:41Z)
Stochastic Proximal Gradient Algorithm with Minibatches. Application to Large Scale Learning Models [2.384873896423002]
We develop and analyze minibatch variants of gradient algorithm for general composite objective functions with nonsmooth components. We provide complexity for constant and variable stepsize iteration policies obtaining that, for minibatch size $N$, after $mathcalO(frac1Nepsilon)$ $epsilon-$subity is attained in expected quadratic distance to optimal solution.
arXiv Detail & Related papers (2020-03-30T10:43:56Z)
Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients. We introduce a new sparsity operator: The random-top-k operator. Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.