Mixing ADAM and SGD: a Combined Optimization Method
- URL: http://arxiv.org/abs/2011.08042v1
- Date: Mon, 16 Nov 2020 15:48:38 GMT
- Title: Mixing ADAM and SGD: a Combined Optimization Method
- Authors: Nicola Landro, Ignazio Gallo, Riccardo La Grassa
- Abstract summary: We propose a new type of optimization method called MAS (Mixing ADAM and SGD)
Rather than trying to improve SGD or ADAM we exploit both at the same time by taking the best of both.
We have conducted several experiments on images and text document classification, using various CNNs, and we demonstrated by experiments that the proposed MAS produces better performance than the single SGD or ADAMs.
- Score: 0.9569316316728905
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Optimization methods (optimizers) get special attention for the efficient
training of neural networks in the field of deep learning. In literature there
are many papers that compare neural models trained with the use of different
optimizers. Each paper demonstrates that for a particular problem an optimizer
is better than the others but as the problem changes this type of result is no
longer valid and we have to start from scratch. In our paper we propose to use
the combination of two very different optimizers but when used simultaneously
they can overcome the performances of the single optimizers in very different
problems. We propose a new optimizer called MAS (Mixing ADAM and SGD) that
integrates SGD and ADAM simultaneously by weighing the contributions of both
through the assignment of constant weights. Rather than trying to improve SGD
or ADAM we exploit both at the same time by taking the best of both. We have
conducted several experiments on images and text document classification, using
various CNNs, and we demonstrated by experiments that the proposed MAS
optimizer produces better performance than the single SGD or ADAM optimizers.
The source code and all the results of the experiments are available online at
the following link https://gitlab.com/nicolalandro/multi\_optimizer
Related papers
- Unlearning as multi-task optimization: A normalized gradient difference approach with an adaptive learning rate [105.86576388991713]
We introduce a normalized gradient difference (NGDiff) algorithm, enabling us to have better control over the trade-off between the objectives.
We provide a theoretical analysis and empirically demonstrate the superior performance of NGDiff among state-of-the-art unlearning methods on the TOFU and MUSE datasets.
arXiv Detail & Related papers (2024-10-29T14:41:44Z) - Should I try multiple optimizers when fine-tuning pre-trained
Transformers for NLP tasks? Should I tune their hyperparameters? [14.349943044268471]
Gradient Descent (SGD) is employed to select neural models for training.
tuning just the learning rate is in most cases as good as tuning all the hyperparameters.
We recommend picking any of the best-behaved adaptiveBounds (e.g., Adam) and recommending its learning rate.
arXiv Detail & Related papers (2024-02-10T13:26:14Z) - MADA: Meta-Adaptive Optimizers through hyper-gradient Descent [73.1383658672682]
We introduce Meta-Adaptives (MADA), a unified framework that can generalize several known convergences and dynamically learn the most suitable one during training.
We empirically compare MADA to other populars on vision and language tasks, and find that MADA consistently outperforms Adam and other populars.
We also propose AVGrad, a modification of AMS that replaces the maximum operator with averaging, which is more suitable for hyper-gradient optimization.
arXiv Detail & Related papers (2024-01-17T00:16:46Z) - Multimodal Optimization with k-Cluster Big Bang-Big Crunch Algorithm and Postprocessing Methods for Identification and Quantification of Optima [0.7639610349097473]
Multimodal optimization is often encountered in engineering problems, especially when different and alternative solutions are sought.
This paper investigates whether a less-known, the Big Bang-Big Crunch (BBBC) algorithm, is suitable for multimodal optimization.
arXiv Detail & Related papers (2023-12-21T06:16:32Z) - Judging Adam: Studying the Performance of Optimization Methods on ML4SE
Tasks [2.8961929092154697]
We test the performance of variouss on deep learning models for source code.
We find that the choice of anahead can have a significant impact on the model quality.
We suggest that the ML4SE community should consider using RAdam instead Adam as the default for code-related deep learning tasks.
arXiv Detail & Related papers (2023-03-06T22:49:20Z) - Deep Negative Correlation Classification [82.45045814842595]
Existing deep ensemble methods naively train many different models and then aggregate their predictions.
We propose deep negative correlation classification (DNCC)
DNCC yields a deep classification ensemble where the individual estimator is both accurate and negatively correlated.
arXiv Detail & Related papers (2022-12-14T07:35:20Z) - VeLO: Training Versatile Learned Optimizers by Scaling Up [67.90237498659397]
We leverage the same scaling approach behind the success of deep learning to learn versatiles.
We train an ingest for deep learning which is itself a small neural network that ingests and outputs parameter updates.
We open source our learned, meta-training code, the associated train test data, and an extensive benchmark suite with baselines at velo-code.io.
arXiv Detail & Related papers (2022-11-17T18:39:07Z) - Optimizer Amalgamation [124.33523126363728]
We are motivated to study a new problem named Amalgamation: how can we best combine a pool of "teacher" amalgamations into a single "student" that can have stronger problem-specific performance?
First, we define three differentiable mechanisms to amalgamate a pool of analyticals by gradient descent.
In order to reduce variance of the process, we also explore methods to stabilize the process by perturbing the target.
arXiv Detail & Related papers (2022-03-12T16:07:57Z) - Tasks, stability, architecture, and compute: Training more effective
learned optimizers, and using them to train themselves [53.37905268850274]
We introduce a new, hierarchical, neural network parameterized, hierarchical with access to additional features such as validation loss to enable automatic regularization.
Most learneds have been trained on only a single task, or a small number of tasks.
We train ours on thousands of tasks, making use of orders of magnitude more compute, resulting in generalizes that perform better to unseen tasks.
arXiv Detail & Related papers (2020-09-23T16:35:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.