Scalable and Practical Natural Gradient for Large-Scale Deep Learning
- URL: http://arxiv.org/abs/2002.06015v1
- Date: Thu, 13 Feb 2020 11:55:37 GMT
- Title: Scalable and Practical Natural Gradient for Large-Scale Deep Learning
- Authors: Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Chuan-Sheng
Foo, and Rio Yokota
- Abstract summary: SP-NGD scales to large mini-batch sizes with a negligible computational overhead as compared to first-order methods.
We demonstrate convergence to a top-1 validation accuracy of 75.4% in 5.5 minutes using a mini-batch size of 32,768 with 1,024 GPUs, as well as an accuracy of 74.9% with an extremely large mini-batch size of 131,072 in 873 steps of SP-NGD.
- Score: 19.220930193896404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale distributed training of deep neural networks results in models
with worse generalization performance as a result of the increase in the
effective mini-batch size. Previous approaches attempt to address this problem
by varying the learning rate and batch size over epochs and layers, or ad hoc
modifications of batch normalization. We propose Scalable and Practical Natural
Gradient Descent (SP-NGD), a principled approach for training models that
allows them to attain similar generalization performance to models trained with
first-order optimization methods, but with accelerated convergence.
Furthermore, SP-NGD scales to large mini-batch sizes with a negligible
computational overhead as compared to first-order methods. We evaluated SP-NGD
on a benchmark task where highly optimized first-order methods are available as
references: training a ResNet-50 model for image classification on ImageNet. We
demonstrate convergence to a top-1 validation accuracy of 75.4% in 5.5 minutes
using a mini-batch size of 32,768 with 1,024 GPUs, as well as an accuracy of
74.9% with an extremely large mini-batch size of 131,072 in 873 steps of
SP-NGD.
Related papers
- Speeding Up Image Classifiers with Little Companions [5.9999780224657195]
Scaling up neural networks has been a key recipe to the success of large language and vision models.
We develop a simple model-a two-pass Little-Big that first uses a light-weight "little" model to make predictions of all samples.
Little-Big also speeds up the InternImage-G-512 while achieving 90% ImageNet-1K top-1 accuracy.
arXiv Detail & Related papers (2024-06-24T20:11:46Z) - AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods [17.043034606088234]
We introduce AdAdaGrad's scalar variant AdAdaGradNorm, which increase sizes during training.
We also perform image classification experiments, highlighting the merits of our proposed strategies.
arXiv Detail & Related papers (2024-02-17T07:49:50Z) - Pruning Large Language Models via Accuracy Predictor [0.0]
Large language models (LLMs) containing tens of billions of parameters (or even more) have demonstrated impressive capabilities in various NLP tasks.
We propose a novel pruning approach: firstly, a training set of a certain number of architecture-accuracy pairs is established, and then a non-neural model is trained as an accuracy predictor.
arXiv Detail & Related papers (2023-09-18T06:38:24Z) - Combined Scaling for Zero-shot Transfer Learning [146.0851484769142]
We present a combined scaling method - named BASIC - that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set.
This accuracy surpasses best published similar models - CLIP and ALIGN - by 9.3%.
Our model also shows significant improvements in robustness benchmarks.
arXiv Detail & Related papers (2021-11-19T05:25:46Z) - Effective Model Sparsification by Scheduled Grow-and-Prune Methods [73.03533268740605]
We propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models.
Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks.
arXiv Detail & Related papers (2021-06-18T01:03:13Z) - Concurrent Adversarial Learning for Large-Batch Training [83.55868483681748]
Adversarial learning is a natural choice for smoothing the decision surface and biasing towards a flat region.
We propose a novel Concurrent Adversarial Learning (ConAdv) method that decouples the sequential gradient computations in adversarial learning by utilizing staled parameters.
This is the first work successfully scales ResNet-50 training batch size to 96K.
arXiv Detail & Related papers (2021-06-01T04:26:02Z) - AdaScale SGD: A User-Friendly Algorithm for Distributed Training [29.430153773234363]
We propose AdaScale SGD, an algorithm that reliably adapts learning rates to large-batch training.
By continually adapting to the gradient's variance, AdaScale achieves speed-ups for a wide range of batch sizes.
This includes large-batch training with no model degradation for machine translation, image classification, object detection, and speech recognition tasks.
arXiv Detail & Related papers (2020-07-09T23:26:13Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z) - Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications.
In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training.
Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.