Bidirectional Looking with A Novel Double Exponential Moving Average to
Adaptive and Non-adaptive Momentum Optimizers
- URL: http://arxiv.org/abs/2307.00631v1
- Date: Sun, 2 Jul 2023 18:16:06 GMT
- Title: Bidirectional Looking with A Novel Double Exponential Moving Average to
Adaptive and Non-adaptive Momentum Optimizers
- Authors: Yineng Chen, Zuchao Li, Lefei Zhang, Bo Du, Hai Zhao
- Abstract summary: We propose a novel textscAdmeta (textbfADouble exponential textbfMov averagtextbfE textbfAdaptive and non-adaptive momentum) framework.
We provide two implementations, textscAdmetaR and textscAdmetaS, the former based on RAdam and the latter based on SGDM.
- Score: 109.52244418498974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Optimizer is an essential component for the success of deep learning, which
guides the neural network to update the parameters according to the loss on the
training set. SGD and Adam are two classical and effective optimizers on which
researchers have proposed many variants, such as SGDM and RAdam. In this paper,
we innovatively combine the backward-looking and forward-looking aspects of the
optimizer algorithm and propose a novel \textsc{Admeta} (\textbf{A}
\textbf{D}ouble exponential \textbf{M}oving averag\textbf{E} \textbf{T}o
\textbf{A}daptive and non-adaptive momentum) optimizer framework. For
backward-looking part, we propose a DEMA variant scheme, which is motivated by
a metric in the stock market, to replace the common exponential moving average
scheme. While in the forward-looking part, we present a dynamic lookahead
strategy which asymptotically approaches a set value, maintaining its speed at
early stage and high convergence performance at final stage. Based on this
idea, we provide two optimizer implementations, \textsc{AdmetaR} and
\textsc{AdmetaS}, the former based on RAdam and the latter based on SGDM.
Through extensive experiments on diverse tasks, we find that the proposed
\textsc{Admeta} optimizer outperforms our base optimizers and shows advantages
over recently proposed competitive optimizers. We also provide theoretical
proof of these two algorithms, which verifies the convergence of our proposed
\textsc{Admeta}.
Related papers
- An Adaptive Dual-Domain Prediction Strategy based on Second-order Derivatives for Dynamic Multi-Objective Optimization [7.272641346606365]
This paper demonstrates new approaches to change prediction strategies within an evolutionary algorithm paradigm.
We propose a novel adaptive prediction strategy, which utilizes the concept of second-order derivatives adaptively in different domains.
We compare the performance of the proposed algorithm against four other state-of-the-art algorithms from the literature, using DMOPs benchmark problems.
arXiv Detail & Related papers (2024-10-08T08:13:49Z) - Adam with model exponential moving average is effective for nonconvex optimization [45.242009309234305]
We offer a theoretical analysis of two modern optimization techniques for training large and complex models: (i) adaptive optimization algorithms as Adam, and (ii) the exponential moving average (EMA) model.
arXiv Detail & Related papers (2024-05-28T14:08:04Z) - SGD with Partial Hessian for Deep Neural Networks Optimization [18.78728272603732]
We propose a compound, which is a combination of a second-order with a precise partial Hessian matrix for updating channel-wise parameters and the first-order gradient descent (SGD) algorithms for updating the other parameters.
Compared with first-orders, it adopts a certain amount of information from the Hessian matrix to assist optimization, while compared with the existing second-order generalizations, it keeps the good performance of first-order generalizations imprecise.
arXiv Detail & Related papers (2024-03-05T06:10:21Z) - MADA: Meta-Adaptive Optimizers through hyper-gradient Descent [73.1383658672682]
We introduce Meta-Adaptives (MADA), a unified framework that can generalize several known convergences and dynamically learn the most suitable one during training.
We empirically compare MADA to other populars on vision and language tasks, and find that MADA consistently outperforms Adam and other populars.
We also propose AVGrad, a modification of AMS that replaces the maximum operator with averaging, which is more suitable for hyper-gradient optimization.
arXiv Detail & Related papers (2024-01-17T00:16:46Z) - Backpropagation of Unrolled Solvers with Folded Optimization [55.04219793298687]
The integration of constrained optimization models as components in deep networks has led to promising advances on many specialized learning tasks.
One typical strategy is algorithm unrolling, which relies on automatic differentiation through the operations of an iterative solver.
This paper provides theoretical insights into the backward pass of unrolled optimization, leading to a system for generating efficiently solvable analytical models of backpropagation.
arXiv Detail & Related papers (2023-01-28T01:50:42Z) - Moment Centralization based Gradient Descent Optimizers for
Convolutional Neural Networks [12.90962626557934]
Conal neural networks (CNNs) have shown very appealing performance for many computer vision applications.
In this paper, we propose a moment centralization-based SGD datasets for CNNs.
The proposed moment centralization is generic in nature and can be integrated with any of the existing adaptive momentum-baseds.
arXiv Detail & Related papers (2022-07-19T04:38:01Z) - RoMA: Robust Model Adaptation for Offline Model-based Optimization [115.02677045518692]
We consider the problem of searching an input maximizing a black-box objective function given a static dataset of input-output queries.
A popular approach to solving this problem is maintaining a proxy model that approximates the true objective function.
Here, the main challenge is how to avoid adversarially optimized inputs during the search.
arXiv Detail & Related papers (2021-10-27T05:37:12Z) - Meta-Learning with Neural Tangent Kernels [58.06951624702086]
We propose the first meta-learning paradigm in the Reproducing Kernel Hilbert Space (RKHS) induced by the meta-model's Neural Tangent Kernel (NTK)
Within this paradigm, we introduce two meta-learning algorithms, which no longer need a sub-optimal iterative inner-loop adaptation as in the MAML framework.
We achieve this goal by 1) replacing the adaptation with a fast-adaptive regularizer in the RKHS; and 2) solving the adaptation analytically based on the NTK theory.
arXiv Detail & Related papers (2021-02-07T20:53:23Z) - Bilevel Optimization: Convergence Analysis and Enhanced Design [63.64636047748605]
Bilevel optimization is a tool for many machine learning problems.
We propose a novel stoc-efficientgradient estimator named stoc-BiO.
arXiv Detail & Related papers (2020-10-15T18:09:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.