Rethinking SIGN Training: Provable Nonconvex Acceleration without First-
and Second-Order Gradient Lipschitz
- URL: http://arxiv.org/abs/2310.14616v1
- Date: Mon, 23 Oct 2023 06:48:43 GMT
- Title: Rethinking SIGN Training: Provable Nonconvex Acceleration without First-
and Second-Order Gradient Lipschitz
- Authors: Tao Sun, Congliang Chen, Peng Qiao, Li Shen, Xinwang Liu, Dongsheng Li
- Abstract summary: Sign-based methods have gained attention due to their ability to achieve robust performance despite only using only the sign information for parameter updates.
The current convergence analysis of sign-based methods relies on the strong assumptions of first-order acceleration and second-order acceleration.
In this paper we analyze their convergence under more realistic assumptions of first- and second-order acceleration.
- Score: 66.22095739795068
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sign-based stochastic methods have gained attention due to their ability to
achieve robust performance despite using only the sign information for
parameter updates. However, the current convergence analysis of sign-based
methods relies on the strong assumptions of first-order gradient Lipschitz and
second-order gradient Lipschitz, which may not hold in practical tasks like
deep neural network training that involve high non-smoothness. In this paper,
we revisit sign-based methods and analyze their convergence under more
realistic assumptions of first- and second-order smoothness. We first establish
the convergence of the sign-based method under weak first-order Lipschitz.
Motivated by the weak first-order Lipschitz, we propose a relaxed second-order
condition that still allows for nonconvex acceleration in sign-based methods.
Based on our theoretical results, we gain insights into the computational
advantages of the recently developed LION algorithm. In distributed settings,
we prove that this nonconvex acceleration persists with linear speedup in the
number of nodes, when utilizing fast communication compression gossip
protocols. The novelty of our theoretical results lies in that they are derived
under much weaker assumptions, thereby expanding the provable applicability of
sign-based algorithms to a wider range of problems.
Related papers
- Stochastic Zeroth-Order Optimization under Strongly Convexity and Lipschitz Hessian: Minimax Sample Complexity [59.75300530380427]
We consider the problem of optimizing second-order smooth and strongly convex functions where the algorithm is only accessible to noisy evaluations of the objective function it queries.
We provide the first tight characterization for the rate of the minimax simple regret by developing matching upper and lower bounds.
arXiv Detail & Related papers (2024-06-28T02:56:22Z) - Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling [73.5602474095954]
We study the non-asymptotic performance of approximation schemes with delayed updates under Markovian sampling.
Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms.
arXiv Detail & Related papers (2024-02-19T03:08:02Z) - Efficient Bound of Lipschitz Constant for Convolutional Layers by Gram
Iteration [122.51142131506639]
We introduce a precise, fast, and differentiable upper bound for the spectral norm of convolutional layers using circulant matrix theory.
We show through a comprehensive set of experiments that our approach outperforms other state-of-the-art methods in terms of precision, computational cost, and scalability.
It proves highly effective for the Lipschitz regularization of convolutional neural networks, with competitive results against concurrent approaches.
arXiv Detail & Related papers (2023-05-25T15:32:21Z) - Revisiting and Advancing Fast Adversarial Training Through The Lens of
Bi-Level Optimization [60.72410937614299]
We propose a new tractable bi-level optimization problem, design and analyze a new set of algorithms termed Bi-level AT (FAST-BAT)
FAST-BAT is capable of defending sign-based projected descent (PGD) attacks without calling any gradient sign method and explicit robust regularization.
arXiv Detail & Related papers (2021-12-23T06:25:36Z) - Learning Frequency Domain Approximation for Binary Neural Networks [68.79904499480025]
We propose to estimate the gradient of sign function in the Fourier frequency domain using the combination of sine functions for training BNNs.
The experiments on several benchmark datasets and neural architectures illustrate that the binary network learned using our method achieves the state-of-the-art accuracy.
arXiv Detail & Related papers (2021-03-01T08:25:26Z) - Exact Asymptotics for Linear Quadratic Adaptive Control [6.287145010885044]
We study the simplest non-bandit reinforcement learning problem: linear quadratic control (LQAC)
We derive expressions for the regret, estimation error, and prediction error of a stepwise-updating LQAC algorithm.
In simulations on both stable and unstable systems, we find that our theory also describes the algorithm's finite-sample behavior remarkably well.
arXiv Detail & Related papers (2020-11-02T22:43:30Z) - Improved Analysis of Clipping Algorithms for Non-convex Optimization [19.507750439784605]
Recently, citetzhang 2019gradient show that clipped (stochastic) Gradient Descent (GD) converges faster than vanilla GD/SGD.
Experiments confirm the superiority of clipping-based methods in deep learning tasks.
arXiv Detail & Related papers (2020-10-05T14:36:59Z) - A Unified Analysis of First-Order Methods for Smooth Games via Integral
Quadratic Constraints [10.578409461429626]
In this work, we adapt the integral quadratic constraints theory to first-order methods for smooth and strongly-varying games and iteration.
We provide emphfor the first time a global convergence rate for the negative momentum method(NM) with an complexity $mathcalO(kappa1.5)$, which matches its known lower bound.
We show that it is impossible for an algorithm with one step of memory to achieve acceleration if it only queries the gradient once per batch.
arXiv Detail & Related papers (2020-09-23T20:02:00Z) - Towards Understanding Label Smoothing [36.54164997035046]
Label smoothing regularization (LSR) has a great success in deep neural networks by training algorithms.
We show that an appropriate LSR can help to speed up convergence by reducing the variance.
We propose a simple yet effective strategy, namely Two-Stage LAbel smoothing algorithm (TSLA)
arXiv Detail & Related papers (2020-06-20T20:36:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.