Related papers: Mixing ADAM and SGD: a Combined Optimization Method

Mixing ADAM and SGD: a Combined Optimization Method

URL: http://arxiv.org/abs/2011.08042v1
Date: Mon, 16 Nov 2020 15:48:38 GMT
Title: Mixing ADAM and SGD: a Combined Optimization Method
Authors: Nicola Landro, Ignazio Gallo, Riccardo La Grassa
Abstract summary: We propose a new type of optimization method called MAS (Mixing ADAM and SGD) Rather than trying to improve SGD or ADAM we exploit both at the same time by taking the best of both. We have conducted several experiments on images and text document classification, using various CNNs, and we demonstrated by experiments that the proposed MAS produces better performance than the single SGD or ADAMs.
Score: 0.9569316316728905
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Optimization methods (optimizers) get special attention for the efficient training of neural networks in the field of deep learning. In literature there are many papers that compare neural models trained with the use of different optimizers. Each paper demonstrates that for a particular problem an optimizer is better than the others but as the problem changes this type of result is no longer valid and we have to start from scratch. In our paper we propose to use the combination of two very different optimizers but when used simultaneously they can overcome the performances of the single optimizers in very different problems. We propose a new optimizer called MAS (Mixing ADAM and SGD) that integrates SGD and ADAM simultaneously by weighing the contributions of both through the assignment of constant weights. Rather than trying to improve SGD or ADAM we exploit both at the same time by taking the best of both. We have conducted several experiments on images and text document classification, using various CNNs, and we demonstrated by experiments that the proposed MAS optimizer produces better performance than the single SGD or ADAM optimizers. The source code and all the results of the experiments are available online at the following link https://gitlab.com/nicolalandro/multi\_optimizer

Related papers

Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.92083784393418]
Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance. We propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.
arXiv Detail & Related papers (2025-04-02T17:40:47Z)
Unlearning as multi-task optimization: A normalized gradient difference approach with an adaptive learning rate [105.86576388991713]
We introduce a normalized gradient difference (NGDiff) algorithm, enabling us to have better control over the trade-off between the objectives. We provide a theoretical analysis and empirically demonstrate the superior performance of NGDiff among state-of-the-art unlearning methods on the TOFU and MUSE datasets.
arXiv Detail & Related papers (2024-10-29T14:41:44Z)
Should I try multiple optimizers when fine-tuning pre-trained Transformers for NLP tasks? Should I tune their hyperparameters? [14.349943044268471]
Gradient Descent (SGD) is employed to select neural models for training. tuning just the learning rate is in most cases as good as tuning all the hyperparameters. We recommend picking any of the best-behaved adaptiveBounds (e.g., Adam) and recommending its learning rate.
arXiv Detail & Related papers (2024-02-10T13:26:14Z)
MADA: Meta-Adaptive Optimizers through hyper-gradient Descent [73.1383658672682]
We introduce Meta-Adaptives (MADA), a unified framework that can generalize several known convergences and dynamically learn the most suitable one during training. We empirically compare MADA to other populars on vision and language tasks, and find that MADA consistently outperforms Adam and other populars. We also propose AVGrad, a modification of AMS that replaces the maximum operator with averaging, which is more suitable for hyper-gradient optimization.
arXiv Detail & Related papers (2024-01-17T00:16:46Z)
Multimodal Optimization with k-Cluster Big Bang-Big Crunch Algorithm and Postprocessing Methods for Identification and Quantification of Optima [0.7639610349097473]
Multimodal optimization is often encountered in engineering problems, especially when different and alternative solutions are sought. This paper investigates whether a less-known, the Big Bang-Big Crunch (BBBC) algorithm, is suitable for multimodal optimization.
arXiv Detail & Related papers (2023-12-21T06:16:32Z)
Judging Adam: Studying the Performance of Optimization Methods on ML4SE Tasks [2.8961929092154697]
We test the performance of variouss on deep learning models for source code. We find that the choice of anahead can have a significant impact on the model quality. We suggest that the ML4SE community should consider using RAdam instead Adam as the default for code-related deep learning tasks.
arXiv Detail & Related papers (2023-03-06T22:49:20Z)
Deep Negative Correlation Classification [82.45045814842595]
Existing deep ensemble methods naively train many different models and then aggregate their predictions. We propose deep negative correlation classification (DNCC) DNCC yields a deep classification ensemble where the individual estimator is both accurate and negatively correlated.
arXiv Detail & Related papers (2022-12-14T07:35:20Z)
VeLO: Training Versatile Learned Optimizers by Scaling Up [67.90237498659397]
We leverage the same scaling approach behind the success of deep learning to learn versatiles. We train an ingest for deep learning which is itself a small neural network that ingests and outputs parameter updates. We open source our learned, meta-training code, the associated train test data, and an extensive benchmark suite with baselines at velo-code.io.
arXiv Detail & Related papers (2022-11-17T18:39:07Z)
Optimizer Amalgamation [124.33523126363728]
We are motivated to study a new problem named Amalgamation: how can we best combine a pool of "teacher" amalgamations into a single "student" that can have stronger problem-specific performance? First, we define three differentiable mechanisms to amalgamate a pool of analyticals by gradient descent. In order to reduce variance of the process, we also explore methods to stabilize the process by perturbing the target.
arXiv Detail & Related papers (2022-03-12T16:07:57Z)
Tasks, stability, architecture, and compute: Training more effective learned optimizers, and using them to train themselves [53.37905268850274]
We introduce a new, hierarchical, neural network parameterized, hierarchical with access to additional features such as validation loss to enable automatic regularization. Most learneds have been trained on only a single task, or a small number of tasks. We train ours on thousands of tasks, making use of orders of magnitude more compute, resulting in generalizes that perform better to unseen tasks.
arXiv Detail & Related papers (2020-09-23T16:35:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.