Related papers: Fast Margin Maximization via Dual Acceleration

Fast Margin Maximization via Dual Acceleration

URL: http://arxiv.org/abs/2107.00595v1
Date: Thu, 1 Jul 2021 16:36:39 GMT
Title: Fast Margin Maximization via Dual Acceleration
Authors: Ziwei Ji, Nathan Srebro, Matus Telgarsky
Abstract summary: We present and analyze a momentum-based method for training linear classifiers with an exponentially-tailed loss. This momentum-based method is derived via the convex dual of the maximum-margin problem, and specifically by applying Nesterov acceleration to this dual.
Score: 52.62944011696364
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present and analyze a momentum-based gradient method for training linear classifiers with an exponentially-tailed loss (e.g., the exponential or logistic loss), which maximizes the classification margin on separable data at a rate of $\widetilde{\mathcal{O}}(1/t^2)$. This contrasts with a rate of $\mathcal{O}(1/\log(t))$ for standard gradient descent, and $\mathcal{O}(1/t)$ for normalized gradient descent. This momentum-based method is derived via the convex dual of the maximum-margin problem, and specifically by applying Nesterov acceleration to this dual, which manages to result in a simple and intuitive method in the primal. This dual view can also be used to derive a stochastic variant, which performs adaptive non-uniform sampling via the dual variables.

Related papers

Generalized Gradient Norm Clipping & Non-Euclidean $(L_0,L_1)$-Smoothness [51.302674884611335]
This work introduces a hybrid non-Euclidean optimization method which generalizes norm clipping by combining steepest descent and conditional gradient approaches.<n>We discuss how to instantiate the algorithms for deep learning and demonstrate their properties on image classification and language modeling.
arXiv Detail & Related papers (2025-06-02T17:34:29Z)
Methods for Convex $(L_0,L_1)$-Smooth Optimization: Clipping, Acceleration, and Adaptivity [50.25258834153574]
We focus on the class of (strongly) convex $(L0)$-smooth functions and derive new convergence guarantees for several existing methods. In particular, we derive improved convergence rates for Gradient Descent with smoothnessed Gradient Clipping and for Gradient Descent with Polyak Stepsizes.
arXiv Detail & Related papers (2024-09-23T13:11:37Z)
The Implicit Bias of Batch Normalization in Linear Models and Two-layer Linear Convolutional Neural Networks [117.93273337740442]
We show that gradient descent converges to a uniform margin classifier on the training data with an $exp(-Omega(log2 t))$ convergence rate. We also show that batch normalization has an implicit bias towards a patch-wise uniform margin.
arXiv Detail & Related papers (2023-06-20T16:58:00Z)
Accelerated Quasi-Newton Proximal Extragradient: Faster Rate for Smooth Convex Optimization [26.328847475942894]
We prove that our method can achieve a convergence rate of $Obigl(minfrac1k2, fracsqrtdlog kk2.5bigr)$, where $d$ is the problem dimension and $k$ is the number of iterations. To the best of our knowledge, this result is the first to demonstrate a provable gain of a quasi-Newton-type method over Nesterov's accelerated gradient.
arXiv Detail & Related papers (2023-06-03T23:31:27Z)
Optimal Extragradient-Based Bilinearly-Coupled Saddle-Point Optimization [116.89941263390769]
We consider the smooth convex-concave bilinearly-coupled saddle-point problem, $min_mathbfxmax_mathbfyF(mathbfx) + H(mathbfx,mathbfy)$, where one has access to first-order oracles for $F$, $G$ as well as the bilinear coupling function $H$. We present a emphaccelerated gradient-extragradient (AG-EG) descent-ascent algorithm that combines extragrad
arXiv Detail & Related papers (2022-06-17T06:10:20Z)
High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails. We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z)
A Variance Controlled Stochastic Method with Biased Estimation for Faster Non-convex Optimization [0.0]
We propose a new technique, em variance controlled gradient (VCSG), to improve the performance of the reduced gradient (SVRG) $lambda$ is introduced in VCSG to avoid over-reducing a variance by SVRG. $mathcalO(min1/epsilon3/2,n1/4/epsilon)$ number of gradient evaluations, which improves the leading gradient complexity.
arXiv Detail & Related papers (2021-02-19T12:22:56Z)
A Unified Analysis of First-Order Methods for Smooth Games via Integral Quadratic Constraints [10.578409461429626]
In this work, we adapt the integral quadratic constraints theory to first-order methods for smooth and strongly-varying games and iteration. We provide emphfor the first time a global convergence rate for the negative momentum method(NM) with an complexity $mathcalO(kappa1.5)$, which matches its known lower bound. We show that it is impossible for an algorithm with one step of memory to achieve acceleration if it only queries the gradient once per batch.
arXiv Detail & Related papers (2020-09-23T20:02:00Z)
Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite Epochs [25.158203665218164]
We show that adaptive gradient methods can be faster than random shuffling SGD after finite time. To the best of our knowledge, it is the first to demonstrate that adaptive gradient methods can be faster than SGD after finite time.
arXiv Detail & Related papers (2020-06-12T09:39:47Z)
Stochastic gradient-free descents [8.663453034925363]
We propose gradient-free methods and accelerated gradients with momentum for solving optimization problems. We analyze the convergence behavior of these methods under the mean-variance framework.
arXiv Detail & Related papers (2019-12-31T13:56:36Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.