Existence and Estimation of Critical Batch Size for Training Generative
Adversarial Networks with Two Time-Scale Update Rule
- URL: http://arxiv.org/abs/2201.11989v6
- Date: Mon, 5 Jun 2023 13:20:53 GMT
- Title: Existence and Estimation of Critical Batch Size for Training Generative
Adversarial Networks with Two Time-Scale Update Rule
- Authors: Naoki Sato and Hideaki Iiduka
- Abstract summary: Previous results have shown that a two time-scale update rule (TTUR) using different learning rates is useful for training generative adversarial networks (GANs) in theory and in practice.
This paper studies the relationship between batch size and the number of steps needed for training GANs with TTURs based on constant learning rates.
- Score: 0.2741266294612775
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous results have shown that a two time-scale update rule (TTUR) using
different learning rates, such as different constant rates or different
decaying rates, is useful for training generative adversarial networks (GANs)
in theory and in practice. Moreover, not only the learning rate but also the
batch size is important for training GANs with TTURs and they both affect the
number of steps needed for training. This paper studies the relationship
between batch size and the number of steps needed for training GANs with TTURs
based on constant learning rates. We theoretically show that, for a TTUR with
constant learning rates, the number of steps needed to find stationary points
of the loss functions of both the discriminator and generator decreases as the
batch size increases and that there exists a critical batch size minimizing the
stochastic first-order oracle (SFO) complexity. Then, we use the Fr'echet
inception distance (FID) as the performance measure for training and provide
numerical results indicating that the number of steps needed to achieve a low
FID score decreases as the batch size increases and that the SFO complexity
increases once the batch size exceeds the measured critical batch size.
Moreover, we show that measured critical batch sizes are close to the sizes
estimated from our theoretical results.
Related papers
- How Does Critical Batch Size Scale in Pre-training? [23.284171845875985]
Critical batch size (CBS) is the threshold beyond which greater data parallelism leads to diminishing returns.
We propose a measure of CBS and pre-train a series of auto-regressive language models on the C4 dataset.
Our results demonstrate that CBS scales primarily with data size rather than model size.
arXiv Detail & Related papers (2024-10-29T02:54:06Z) - Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs [24.305423716384272]
We study the impact of the batch size on the iteration time $T$ of training two-layer neural networks with one-pass gradient descent (SGD)
We show that performing gradient updates with large batches minimizes the training time without changing the total sample complexity.
We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs)
arXiv Detail & Related papers (2024-06-04T09:44:49Z) - Rethinking Resource Management in Edge Learning: A Joint Pre-training and Fine-tuning Design Paradigm [87.47506806135746]
In some applications, edge learning is experiencing a shift in focusing from conventional learning from scratch to new two-stage learning.
This paper considers the problem of joint communication and computation resource management in a two-stage edge learning system.
It is shown that the proposed joint resource management over the pre-training and fine-tuning stages well balances the system performance trade-off.
arXiv Detail & Related papers (2024-04-01T00:21:11Z) - Iteration and Stochastic First-order Oracle Complexities of Stochastic
Gradient Descent using Constant and Decaying Learning Rates [0.8158530638728501]
We show that the performance of descent (SGD) depends on not only the learning rate but also the batch size.
We show that measured critical batch sizes are close to the sizes estimated from our theoretical results.
arXiv Detail & Related papers (2024-02-23T14:24:45Z) - Relationship between Batch Size and Number of Steps Needed for Nonconvex
Optimization of Stochastic Gradient Descent using Armijo Line Search [0.8158530638728501]
We show that SGD performs better than other deep learning networks when it uses deep numerical line.
The results indicate that the number of steps needed for SFO as the batch size grows can be estimated.
arXiv Detail & Related papers (2023-07-25T21:59:17Z) - BatchGFN: Generative Flow Networks for Batch Active Learning [80.73649229919454]
BatchGFN is a novel approach for pool-based active learning that uses generative flow networks to sample sets of data points proportional to a batch reward.
We show our approach enables principled sampling near-optimal utility batches at inference time with a single forward pass per point in the batch in toy regression problems.
arXiv Detail & Related papers (2023-06-26T20:41:36Z) - Adaptive Cross Batch Normalization for Metric Learning [75.91093210956116]
Metric learning is a fundamental problem in computer vision.
We show that it is equally important to ensure that the accumulated embeddings are up to date.
In particular, it is necessary to circumvent the representational drift between the accumulated embeddings and the feature embeddings at the current training iteration.
arXiv Detail & Related papers (2023-03-30T03:22:52Z) - Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch
Size [58.762959061522736]
We show that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude.
We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time.
arXiv Detail & Related papers (2022-11-20T21:48:25Z) - Mini-Batch Learning Strategies for modeling long term temporal
dependencies: A study in environmental applications [20.979235183394994]
In environmental applications, recurrent neural networks (RNNs) are often used to model physical variables with long temporal dependencies.
Due to mini-batch training, temporal relationships between training segments within the batch (intra-batch) as well as between batches (inter-batch) are not considered.
We propose two strategies to enforce both intra- and inter-batch temporal dependency.
arXiv Detail & Related papers (2022-10-15T17:44:21Z) - Parameter-Efficient Sparsity for Large Language Models Fine-Tuning [63.321205487234074]
We propose a.
sparse-efficient Sparse Training (PST) method to reduce the number of trainable parameters during sparse-aware training.
Experiments with diverse networks (i.e., BERT, RoBERTa and GPT-2) demonstrate PST performs on par or better than previous sparsity methods.
arXiv Detail & Related papers (2022-05-23T02:43:45Z) - AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS)
Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.