Achieving Margin Maximization Exponentially Fast via Progressive Norm
Rescaling
- URL: http://arxiv.org/abs/2311.14387v3
- Date: Mon, 29 Jan 2024 00:53:58 GMT
- Title: Achieving Margin Maximization Exponentially Fast via Progressive Norm
Rescaling
- Authors: Mingze Wang, Zeping Min, Lei Wu
- Abstract summary: We investigate margin-maximization bias by gradient-based algorithms in classifying linearly separable data.
We propose a novel algorithm called Progressive Rescaling Gradient (PRGD) and show that PRGD can maximize the margin at an em exponential rate
PRGD also shows promise in enhancing the generalization performance when applied to linearly non-separable datasets and deep neural networks.
- Score: 7.6730288475318815
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we investigate the margin-maximization bias exhibited by
gradient-based algorithms in classifying linearly separable data. We present an
in-depth analysis of the specific properties of the velocity field associated
with (normalized) gradients, focusing on their role in margin maximization.
Inspired by this analysis, we propose a novel algorithm called Progressive
Rescaling Gradient Descent (PRGD) and show that PRGD can maximize the margin at
an {\em exponential rate}. This stands in stark contrast to all existing
algorithms, which maximize the margin at a slow {\em polynomial rate}.
Specifically, we identify mild conditions on data distribution under which
existing algorithms such as gradient descent (GD) and normalized gradient
descent (NGD) {\em provably fail} in maximizing the margin efficiently. To
validate our theoretical findings, we present both synthetic and real-world
experiments. Notably, PRGD also shows promise in enhancing the generalization
performance when applied to linearly non-separable datasets and deep neural
networks.
Related papers
- Nesterov acceleration in benignly non-convex landscapes [0.0]
We show that momentum-based optimization algorithms can be used in the notoriously non- convex setting of deep learning problems.
In this article, we partially close this gap between the accelerated theory and practice setting.
arXiv Detail & Related papers (2024-10-10T22:02:10Z) - Robust Stochastic Optimization via Gradient Quantile Clipping [6.2844649973308835]
We introduce a quant clipping strategy for Gradient Descent (SGD)
We use gradient new outliers as norm clipping chains.
We propose an implementation of the algorithm using Huberiles.
arXiv Detail & Related papers (2023-09-29T15:24:48Z) - Stochastic Marginal Likelihood Gradients using Neural Tangent Kernels [78.6096486885658]
We introduce lower bounds to the linearized Laplace approximation of the marginal likelihood.
These bounds are amenable togradient-based optimization and allow to trade off estimation accuracy against computational complexity.
arXiv Detail & Related papers (2023-06-06T19:02:57Z) - Rigorous dynamical mean field theory for stochastic gradient descent
methods [17.90683687731009]
We prove closed-form equations for the exact high-dimensionals of a family of first order gradient-based methods.
This includes widely used algorithms such as gradient descent (SGD) or Nesterov acceleration.
arXiv Detail & Related papers (2022-10-12T21:10:55Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Fast Margin Maximization via Dual Acceleration [52.62944011696364]
We present and analyze a momentum-based method for training linear classifiers with an exponentially-tailed loss.
This momentum-based method is derived via the convex dual of the maximum-margin problem, and specifically by applying Nesterov acceleration to this dual.
arXiv Detail & Related papers (2021-07-01T16:36:39Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - Improved Analysis of Clipping Algorithms for Non-convex Optimization [19.507750439784605]
Recently, citetzhang 2019gradient show that clipped (stochastic) Gradient Descent (GD) converges faster than vanilla GD/SGD.
Experiments confirm the superiority of clipping-based methods in deep learning tasks.
arXiv Detail & Related papers (2020-10-05T14:36:59Z) - Channel-Directed Gradients for Optimization of Convolutional Neural
Networks [50.34913837546743]
We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error.
We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental.
arXiv Detail & Related papers (2020-08-25T00:44:09Z) - Dynamical mean-field theory for stochastic gradient descent in Gaussian
mixture classification [25.898873960635534]
We analyze in a closed learning dynamics of gradient descent (SGD) for a single-layer neural network classifying a high-dimensional landscape.
We define a prototype process for which can be extended to a continuous-dimensional gradient flow.
In the full-batch limit, we recover the standard gradient flow.
arXiv Detail & Related papers (2020-06-10T22:49:41Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.