Related papers: Anderson acceleration of coordinate descent

Anderson acceleration of coordinate descent

URL: http://arxiv.org/abs/2011.10065v3
Date: Thu, 28 Oct 2021 16:17:24 GMT
Title: Anderson acceleration of coordinate descent
Authors: Quentin Bertrand and Mathurin Massias
Abstract summary: On multiple Machine Learning problems, coordinate descent achieves performance significantly superior to full-gradient methods. We propose an accelerated version of coordinate descent using extrapolation, showing considerable speed up in practice.
Score: 5.794599007795348
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Acceleration of first order methods is mainly obtained via inertial techniques \`a la Nesterov, or via nonlinear extrapolation. The latter has known a recent surge of interest, with successful applications to gradient and proximal gradient techniques. On multiple Machine Learning problems, coordinate descent achieves performance significantly superior to full-gradient methods. Speeding up coordinate descent in practice is not easy: inertially accelerated versions of coordinate descent are theoretically accelerated, but might not always lead to practical speed-ups. We propose an accelerated version of coordinate descent using extrapolation, showing considerable speed up in practice, compared to inertial accelerated coordinate descent and extrapolated (proximal) gradient descent. Experiments on least squares, Lasso, elastic net and logistic regression validate the approach.

Related papers

Acceleration and Implicit Regularization in Gaussian Phase Retrieval [5.484345596034159]
In this setting, we prove that methods with Polyak or Nesterov momentum implicit regularization ensures a nice convex descent. Experimental evidence shows that these methods converge faster than gradient descent in practice.
arXiv Detail & Related papers (2023-11-21T04:10:03Z)
ELRA: Exponential learning rate adaption gradient descent optimization method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption. The main idea of the method is to adapt the $alpha by situational awareness. It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z)
Accelerated gradient methods for nonconvex optimization: Escape trajectories from strict saddle points and convergence to local minima [9.66353475649122]
This paper considers the problem. of understanding convex behavior of a general general of accelerated gradient methods on. non-aptotic functions. It shows that Nesterov's accelerated method (NAG) with variable momentum avoids strict saddle points almost surely.
arXiv Detail & Related papers (2023-07-13T19:11:07Z)
Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models [158.19276683455254]
Adaptive gradient algorithms borrow the moving average idea of heavy ball acceleration to estimate accurate first second-order moments of gradient for accelerating convergence. Nesterov acceleration converges faster than ball acceleration in theory and also in many empirical cases. In this paper we develop a new Nesterov momentum estimation (NME) method, which avoids the extra computation and memory overhead of computing gradient at the point. We show that Adan surpasses the corresponding SoTAs on both vision transformers (ViTs and CNNs) and sets new SoTAs for many popular networks.
arXiv Detail & Related papers (2022-08-13T16:04:39Z)
On Training Implicit Models [75.20173180996501]
We propose a novel gradient estimate for implicit models, named phantom gradient, that forgoes the costly computation of the exact gradient. Experiments on large-scale tasks demonstrate that these lightweight phantom gradients significantly accelerate the backward passes in training implicit models by roughly 1.7 times.
arXiv Detail & Related papers (2021-11-09T14:40:24Z)
Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z)
Scaling transition from momentum stochastic gradient descent to plain stochastic gradient descent [1.7874193862154875]
The momentum gradient descent uses the accumulated gradient as the updated direction of the current parameters. The plain gradient descent has not been corrected by the accumulated gradient. The TSGD algorithm has faster training speed, higher accuracy and better stability.
arXiv Detail & Related papers (2021-06-12T11:42:04Z)
Decreasing scaling transition from adaptive gradient descent to stochastic gradient descent [1.7874193862154875]
We propose a decreasing scaling transition from adaptive gradient descent to gradient descent method DSTAda. Our experimental results show that DSTAda has a faster speed, higher accuracy, and better stability and robustness.
arXiv Detail & Related papers (2021-06-12T11:28:58Z)
Leveraging Non-uniformity in First-order Non-convex Optimization [93.6817946818977]
Non-uniform refinement of objective functions leads to emphNon-uniform Smoothness (NS) and emphNon-uniform Lojasiewicz inequality (NL) New definitions inspire new geometry-aware first-order methods that converge to global optimality faster than the classical $Omega (1/t2)$ lower bounds.
arXiv Detail & Related papers (2021-05-13T04:23:07Z)
Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem. We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent. Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z)
Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite Epochs [25.158203665218164]
We show that adaptive gradient methods can be faster than random shuffling SGD after finite time. To the best of our knowledge, it is the first to demonstrate that adaptive gradient methods can be faster than SGD after finite time.
arXiv Detail & Related papers (2020-06-12T09:39:47Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.