Related papers: Frankenstein Optimizer: Harnessing the Potential by Revisiting Optimization Tricks

Frankenstein Optimizer: Harnessing the Potential by Revisiting Optimization Tricks

URL: http://arxiv.org/abs/2503.02147v1
Date: Tue, 04 Mar 2025 00:25:54 GMT
Title: Frankenstein Optimizer: Harnessing the Potential by Revisiting Optimization Tricks
Authors: Chia-Wei Hsu, Nien-Ti Tsou, Yu-Cheng Chen, Yang Jeong Park, Ju Li,
Abstract summary: We propose Frankenstein, which combines various adaptive algorithms' mechanisms.<n>We show that Frankenstein surpasses existing adaptive algorithms and gradient descent (SGD)<n>This research deepens our understanding of adaptive algorithms through centered kernel alignment analysis and loss landscape visualization during the learning process.
Score: 2.932254642052481
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Gradient-based optimization drives the unprecedented performance of modern deep neural network models across diverse applications. Adaptive algorithms have accelerated neural network training due to their rapid convergence rates; however, they struggle to find ``flat minima" reliably, resulting in suboptimal generalization compared to stochastic gradient descent (SGD). By revisiting various adaptive algorithms' mechanisms, we propose the Frankenstein optimizer, which combines their advantages. The proposed Frankenstein dynamically adjusts first- and second-momentum coefficients according to the optimizer's current state to directly maintain consistent learning dynamics and immediately reflect sudden gradient changes. Extensive experiments across several research domains such as computer vision, natural language processing, few-shot learning, and scientific simulations show that Frankenstein surpasses existing adaptive algorithms and SGD empirically regarding convergence speed and generalization performance. Furthermore, this research deepens our understanding of adaptive algorithms through centered kernel alignment analysis and loss landscape visualization during the learning process.

Related papers

Enhancing CNN Classification with Lamarckian Memetic Algorithms and Local Search [0.0]
We propose a novel approach integrating a two-stage training technique with population-based optimization algorithms incorporating local search capabilities. Our experiments demonstrate that the proposed method outperforms state-of-the-art gradient-based techniques.
arXiv Detail & Related papers (2024-10-26T17:31:15Z)
Gradient-Variation Online Learning under Generalized Smoothness [56.38427425920781]
gradient-variation online learning aims to achieve regret guarantees that scale with variations in gradients of online functions. Recent efforts in neural network optimization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms. We provide the applications for fast-rate convergence in games and extended adversarial optimization.
arXiv Detail & Related papers (2024-08-17T02:22:08Z)
An Automatic Learning Rate Schedule Algorithm for Achieving Faster Convergence and Steeper Descent [10.061799286306163]
We investigate the convergence behavior of the delta-bar-delta algorithm in real-world neural network optimization. To address any potential convergence challenges, we propose a novel approach called RDBD (Regrettable Delta-Bar-Delta) Our approach allows for prompt correction of biased learning rate adjustments and ensures the convergence of the optimization process.
arXiv Detail & Related papers (2023-10-17T14:15:57Z)
Efficient and Flexible Neural Network Training through Layer-wise Feedback Propagation [49.44309457870649]
Layer-wise Feedback feedback (LFP) is a novel training principle for neural network-like predictors.<n>LFP decomposes a reward to individual neurons based on their respective contributions.<n>Our method then implements a greedy reinforcing approach helpful parts of the network and weakening harmful ones.
arXiv Detail & Related papers (2023-08-23T10:48:28Z)
Towards Theoretically Inspired Neural Initialization Optimization [66.04735385415427]
We propose a differentiable quantity, named GradCosine, with theoretical insights to evaluate the initial state of a neural network. We show that both the training and test performance of a network can be improved by maximizing GradCosine under norm constraint. Generalized from the sample-wise analysis into the real batch setting, NIO is able to automatically look for a better initialization with negligible cost.
arXiv Detail & Related papers (2022-10-12T06:49:16Z)
SUPER-ADAM: Faster and Universal Framework of Adaptive Gradients [99.13839450032408]
It is desired to design a universal framework for adaptive algorithms to solve general problems. In particular, our novel framework provides adaptive methods under non convergence support for setting.
arXiv Detail & Related papers (2021-06-15T15:16:28Z)
Meta-Regularization: An Approach to Adaptive Choice of the Learning Rate in Gradient Descent [20.47598828422897]
We propose textit-Meta-Regularization, a novel approach for the adaptive choice of the learning rate in first-order descent methods. Our approach modifies the objective function by adding a regularization term, and casts the joint process parameters.
arXiv Detail & Related papers (2021-04-12T13:13:34Z)
The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks [21.63353575405414]
We study the implicit bias of adaptive optimization algorithms on homogeneous neural networks. It is the first work to study the convergent direction of adaptive optimizations on non-linear deep neural networks.
arXiv Detail & Related papers (2020-12-11T11:15:32Z)
A Dynamical View on Optimization Algorithms of Overparameterized Neural Networks [23.038631072178735]
We consider a broad class of optimization algorithms that are commonly used in practice. As a consequence, we can leverage the convergence behavior of neural networks. We believe our approach can also be extended to other optimization algorithms and network theory.
arXiv Detail & Related papers (2020-10-25T17:10:22Z)
Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem) AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient. Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z)
Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network. Our model requires a much less number of communication rounds and still a number of communication rounds in theory. Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
Learning to be Global Optimizer [28.88646928299302]
We learn an optimal network and escaping capability algorithm for some benchmark functions. We show that the learned algorithm significantly outperforms some well-known classical optimization algorithms.
arXiv Detail & Related papers (2020-03-10T03:46:25Z)
Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications. In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training. Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.