Related papers: Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size

Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size

URL: http://arxiv.org/abs/2211.11092v1
Date: Sun, 20 Nov 2022 21:48:25 GMT
Title: Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size
Authors: Alexander Nikulin, Vladislav Kurenkov, Denis Tarasov, Dmitry Akimov, Sergey Kolesnikov
Abstract summary: We show that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude. We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time.
Score: 58.762959061522736
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training large neural networks is known to be time-consuming, with the learning duration taking days or even weeks. To address this problem, large-batch optimization was introduced. This approach demonstrated that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude. While long training time was not typically a major issue for model-free deep offline RL algorithms, recently introduced Q-ensemble methods achieving state-of-the-art performance made this issue more relevant, notably extending the training duration. In this work, we demonstrate how this class of methods can benefit from large-batch optimization, which is commonly overlooked by the deep offline RL community. We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time, effectively shortening training duration by 3-4x times on average.

Related papers

FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models [28.351652568849286]
This paper investigates how the model's context length and the complexity of the training dataset influence the training process of R1-like models. We propose FastCuRL, a curriculum reinforcement learning framework with the progressive context extension strategy.
arXiv Detail & Related papers (2025-03-21T16:35:31Z)
Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping [14.435637320909663]
MoE technique plays crucial role in expanding the size of DNN model parameters. Existing methods attempt to mitigate this issue by overlapping all-to-all with expert computation. In our study, we extend the scope of this challenge by considering overlap at the broader training graph level. We implement these techniques in Lancet, a system using compiler-based optimization to automatically enhance MoE model training.
arXiv Detail & Related papers (2024-04-30T10:17:21Z)
Learning to Optimize Permutation Flow Shop Scheduling via Graph-based Imitation Learning [70.65666982566655]
Permutation flow shop scheduling (PFSS) is widely used in manufacturing systems. We propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately. Our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average.
arXiv Detail & Related papers (2022-10-31T09:46:26Z)
Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution. Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x. We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z)
Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training [18.640076155697415]
We present a study of a curriculum learning based approach, which helps improve the pre-training convergence speed of autoregressive models. Our evaluations demonstrate that curriculum learning enables training GPT-2 models with 8x larger batch size and 4x larger learning rate.
arXiv Detail & Related papers (2021-08-13T06:32:53Z)
Automated Learning Rate Scheduler for Large-batch Training [24.20872850681828]
Large-batch training has been essential in leveraging large-scale datasets and models in deep learning. It often requires a specially designed learning rate (LR) schedule to achieve a comparable level of performance as in smaller batch training. We propose an automated LR scheduling algorithm which is effective for neural network training with a large batch size under the given epoch budget.
arXiv Detail & Related papers (2021-07-13T05:23:13Z)
Concurrent Adversarial Learning for Large-Batch Training [83.55868483681748]
Adversarial learning is a natural choice for smoothing the decision surface and biasing towards a flat region. We propose a novel Concurrent Adversarial Learning (ConAdv) method that decouples the sequential gradient computations in adversarial learning by utilizing staled parameters. This is the first work successfully scales ResNet-50 training batch size to 96K.
arXiv Detail & Related papers (2021-06-01T04:26:02Z)
AdaScale SGD: A User-Friendly Algorithm for Distributed Training [29.430153773234363]
We propose AdaScale SGD, an algorithm that reliably adapts learning rates to large-batch training. By continually adapting to the gradient's variance, AdaScale achieves speed-ups for a wide range of batch sizes. This includes large-batch training with no model degradation for machine translation, image classification, object detection, and speech recognition tasks.
arXiv Detail & Related papers (2020-07-09T23:26:13Z)
Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes [9.213729275749452]
We propose an accelerated gradient method called LANS to improve the efficiency of using large mini-batches for training. It takes 54 minutes on 192 AWS EC2 P3dn.24xlarge instances to achieve a target F1 score of 90.5 or higher on SQuAD v1.1, achieving the fastest BERT training time in the cloud.
arXiv Detail & Related papers (2020-06-24T05:00:41Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications. In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training. Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.