Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch
Size
- URL: http://arxiv.org/abs/2211.11092v1
- Date: Sun, 20 Nov 2022 21:48:25 GMT
- Title: Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch
Size
- Authors: Alexander Nikulin, Vladislav Kurenkov, Denis Tarasov, Dmitry Akimov,
Sergey Kolesnikov
- Abstract summary: We show that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude.
We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time.
- Score: 58.762959061522736
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training large neural networks is known to be time-consuming, with the
learning duration taking days or even weeks. To address this problem,
large-batch optimization was introduced. This approach demonstrated that
scaling mini-batch sizes with appropriate learning rate adjustments can speed
up the training process by orders of magnitude. While long training time was
not typically a major issue for model-free deep offline RL algorithms, recently
introduced Q-ensemble methods achieving state-of-the-art performance made this
issue more relevant, notably extending the training duration. In this work, we
demonstrate how this class of methods can benefit from large-batch
optimization, which is commonly overlooked by the deep offline RL community. We
show that scaling the mini-batch size and naively adjusting the learning rate
allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of
out-of-distribution actions, and (3) improved convergence time, effectively
shortening training duration by 3-4x times on average.
Related papers
- Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping [14.435637320909663]
MoE technique plays crucial role in expanding the size of DNN model parameters.
Existing methods attempt to mitigate this issue by overlapping all-to-all with expert computation.
In our study, we extend the scope of this challenge by considering overlap at the broader training graph level.
We implement these techniques in Lancet, a system using compiler-based optimization to automatically enhance MoE model training.
arXiv Detail & Related papers (2024-04-30T10:17:21Z) - Learning to Optimize Permutation Flow Shop Scheduling via Graph-based
Imitation Learning [70.65666982566655]
Permutation flow shop scheduling (PFSS) is widely used in manufacturing systems.
We propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately.
Our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average.
arXiv Detail & Related papers (2022-10-31T09:46:26Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - Curriculum Learning: A Regularization Method for Efficient and Stable
Billion-Scale GPT Model Pre-Training [18.640076155697415]
We present a study of a curriculum learning based approach, which helps improve the pre-training convergence speed of autoregressive models.
Our evaluations demonstrate that curriculum learning enables training GPT-2 models with 8x larger batch size and 4x larger learning rate.
arXiv Detail & Related papers (2021-08-13T06:32:53Z) - Automated Learning Rate Scheduler for Large-batch Training [24.20872850681828]
Large-batch training has been essential in leveraging large-scale datasets and models in deep learning.
It often requires a specially designed learning rate (LR) schedule to achieve a comparable level of performance as in smaller batch training.
We propose an automated LR scheduling algorithm which is effective for neural network training with a large batch size under the given epoch budget.
arXiv Detail & Related papers (2021-07-13T05:23:13Z) - Concurrent Adversarial Learning for Large-Batch Training [83.55868483681748]
Adversarial learning is a natural choice for smoothing the decision surface and biasing towards a flat region.
We propose a novel Concurrent Adversarial Learning (ConAdv) method that decouples the sequential gradient computations in adversarial learning by utilizing staled parameters.
This is the first work successfully scales ResNet-50 training batch size to 96K.
arXiv Detail & Related papers (2021-06-01T04:26:02Z) - AdaScale SGD: A User-Friendly Algorithm for Distributed Training [29.430153773234363]
We propose AdaScale SGD, an algorithm that reliably adapts learning rates to large-batch training.
By continually adapting to the gradient's variance, AdaScale achieves speed-ups for a wide range of batch sizes.
This includes large-batch training with no model degradation for machine translation, image classification, object detection, and speech recognition tasks.
arXiv Detail & Related papers (2020-07-09T23:26:13Z) - Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes [9.213729275749452]
We propose an accelerated gradient method called LANS to improve the efficiency of using large mini-batches for training.
It takes 54 minutes on 192 AWS EC2 P3dn.24xlarge instances to achieve a target F1 score of 90.5 or higher on SQuAD v1.1, achieving the fastest BERT training time in the cloud.
arXiv Detail & Related papers (2020-06-24T05:00:41Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications.
In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training.
Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.