Dynamic Batch Adaptation
- URL: http://arxiv.org/abs/2208.00815v1
- Date: Mon, 1 Aug 2022 12:52:09 GMT
- Title: Dynamic Batch Adaptation
- Authors: Cristian Simionescu, George Stoica, Robert Herscovici
- Abstract summary: Current deep learning adaptive methods adjust the step magnitude of parameter updates by altering the effective learning rate used by each parameter.
Motivated by the known inverse relation between batch size and learning rate on update step magnitudes, we introduce a novel training procedure that dynamically decides the dimension and the composition of the current update step.
- Score: 2.861848675707603
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current deep learning adaptive optimizer methods adjust the step magnitude of
parameter updates by altering the effective learning rate used by each
parameter. Motivated by the known inverse relation between batch size and
learning rate on update step magnitudes, we introduce a novel training
procedure that dynamically decides the dimension and the composition of the
current update step. Our procedure, Dynamic Batch Adaptation (DBA) analyzes the
gradients of every sample and selects the subset that best improves certain
metrics such as gradient variance for each layer of the network. We present
results showing DBA significantly improves the speed of model convergence.
Additionally, we find that DBA produces an increased improvement over standard
optimizers when used in data scarce conditions where, in addition to
convergence speed, it also significantly improves model generalization,
managing to train a network with a single fully connected hidden layer using
only 1% of the MNIST dataset to reach 97.79% test accuracy. In an even more
extreme scenario, it manages to reach 97.44% test accuracy using only 10
samples per class. These results represent a relative error rate reduction of
81.78% and 88.07% respectively, compared to the standard optimizers, Stochastic
Gradient Descent (SGD) and Adam.
Related papers
- Stochastic Re-weighted Gradient Descent via Distributionally Robust Optimization [14.23697277904244]
We present Reweighted Gradient Descent (RGD), a novel optimization technique that improves the performance of deep neural networks through dynamic sample re-weighting.
We demonstrate the effectiveness of RGD on various learning tasks, including supervised learning, meta-learning, and out-of-domain generalization.
arXiv Detail & Related papers (2023-06-15T15:58:04Z) - Read the Signs: Towards Invariance to Gradient Descent's Hyperparameter
Initialization [3.1153758106426603]
We propose ActiveLR, an optimization meta algorithm that localizes the learning rate, $alpha$, and adapts them at each epoch according to whether the gradient at each epoch changes sign or not.
We implement the Active version (ours) of widely used and recently published gradient descents, namely SGD with momentum, AdamW, RAdam, and AdaBelief.
arXiv Detail & Related papers (2023-01-24T16:57:00Z) - Input Normalized Stochastic Gradient Descent Training of Deep Neural
Networks [2.1485350418225244]
In this paper, we propose a novel optimization algorithm for training machine learning models called Input Normalized Gradient Descent (INSGD)
Our algorithm updates the network weights using gradient descent with $ell_$ and $ell_$-based normalizations applied to the learning rate, similar to NLMS.
We evaluate the efficiency of our training algorithm on benchmark datasets using ResNet-18, WResNet-20, ResNet-50, and a toy neural network.
arXiv Detail & Related papers (2022-12-20T00:08:37Z) - Exploiting Adam-like Optimization Algorithms to Improve the Performance
of Convolutional Neural Networks [82.61182037130405]
gradient descent (SGD) is the main approach for training deep networks.
In this work, we compare Adam based variants based on the difference between the present and the past gradients.
We have tested ensemble of networks and the fusion with ResNet50 trained with gradient descent.
arXiv Detail & Related papers (2021-03-26T18:55:08Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - Evaluating Prediction-Time Batch Normalization for Robustness under
Covariate Shift [81.74795324629712]
We call prediction-time batch normalization, which significantly improves model accuracy and calibration under covariate shift.
We show that prediction-time batch normalization provides complementary benefits to existing state-of-the-art approaches for improving robustness.
The method has mixed results when used alongside pre-training, and does not seem to perform as well under more natural types of dataset shift.
arXiv Detail & Related papers (2020-06-19T05:08:43Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Dynamic Scale Training for Object Detection [111.33112051962514]
We propose a Dynamic Scale Training paradigm (abbreviated as DST) to mitigate scale variation challenge in object detection.
Experimental results demonstrate the efficacy of our proposed DST towards scale variation handling.
It does not introduce inference overhead and could serve as a free lunch for general detection configurations.
arXiv Detail & Related papers (2020-04-26T16:48:17Z) - ScopeFlow: Dynamic Scene Scoping for Optical Flow [94.42139459221784]
We propose to modify the common training protocols of optical flow.
The improvement is based on observing the bias in sampling challenging data.
We find that both regularization and augmentation should decrease during the training protocol.
arXiv Detail & Related papers (2020-02-25T09:58:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.