Related papers: Accelerating Large Batch Training via Gradient Signal to Noise Ratio (GSNR)

Accelerating Large Batch Training via Gradient Signal to Noise Ratio (GSNR)

URL: http://arxiv.org/abs/2309.13681v1
Date: Sun, 24 Sep 2023 16:08:21 GMT
Title: Accelerating Large Batch Training via Gradient Signal to Noise Ratio (GSNR)
Authors: Guo-qing Jiang, Jinlong Liu, Zixiang Ding, Lin Guo, Wei Lin
Abstract summary: We develop the variance reduced gradient descent technique (VRGD) based on the gradient signal to noise ratio (GSNR) VRGD can accelerate training ($1sim 2 times$), narrow generalization gap and improve final accuracy. We improve ImageNet Top-1 accuracy at 96k by $0.52pp$ than LARS.
Score: 16.351871316985598
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As models for nature language processing (NLP), computer vision (CV) and recommendation systems (RS) require surging computation, a large number of GPUs/TPUs are paralleled as a large batch (LB) to improve training throughput. However, training such LB tasks often meets large generalization gap and downgrades final precision, which limits enlarging the batch size. In this work, we develop the variance reduced gradient descent technique (VRGD) based on the gradient signal to noise ratio (GSNR) and apply it onto popular optimizers such as SGD/Adam/LARS/LAMB. We carry out a theoretical analysis of convergence rate to explain its fast training dynamics, and a generalization analysis to demonstrate its smaller generalization gap on LB training. Comprehensive experiments demonstrate that VRGD can accelerate training ($1\sim 2 \times$), narrow generalization gap and improve final accuracy. We push the batch size limit of BERT pretraining up to 128k/64k and DLRM to 512k without noticeable accuracy loss. We improve ImageNet Top-1 accuracy at 96k by $0.52pp$ than LARS. The generalization gap of BERT and ImageNet training is significantly reduce by over $65\%$.

Related papers

Stochastic Rounding for LLM Training: Theory and Practice [15.071158535119539]
We leverage rounding (SR) to address numerical errors with low-precision representation. Empirical results from pre-training models with up to 6.7B parameters, for the first time, demonstrate that our BF16 with SR strategy outperforms (BF16, FP32) mixed precision strategies.
arXiv Detail & Related papers (2025-02-27T22:08:08Z)
Towards Accurate and Efficient Sub-8-Bit Integer Training [24.853958178296587]
Quantization enables low-bitwidth formats in neural network training. Recent methods have developed new data formats and additional pre-processing operations on quantizers. It remains quite challenging to achieve high accuracy and efficiency simultaneously.
arXiv Detail & Related papers (2024-11-17T03:32:36Z)
Taming 3DGS: High-Quality Radiance Fields with Limited Resources [50.92437599516609]
3D Gaussian Splatting (3DGS) has transformed novel-view synthesis with its fast, interpretable, and high-fidelity rendering. We tackle the challenges of training and rendering 3DGS models on a budget. We derive faster, numerically equivalent solutions for gradient computation and attribute updates.
arXiv Detail & Related papers (2024-06-21T20:44:23Z)
Breaking MLPerf Training: A Case Study on Optimizing BERT [9.486916730173661]
We present novel approaches for fast large-scale training of BERT model. Load balancing is imperative in distributed BERT training since its training are characterized by samples with various lengths. We propose two new ideas, (1) local presorting based on dataset stratification for load balancing and (2) bucket-wise gradient clipping before allreduce.
arXiv Detail & Related papers (2024-02-04T11:12:17Z)
ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models [59.90959789767886]
We show that optimizing consistency training loss minimizes the Wasserstein distance between target and generated distributions. By incorporating a discriminator into the consistency training framework, our method achieves improved FID scores on CIFAR10 and ImageNet 64$times$64 and LSUN Cat 256$times$256 datasets.
arXiv Detail & Related papers (2023-11-23T16:49:06Z)
Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication [23.883543151975136]
Training Graph Neural Networks (GNNs) on large graphs is challenging due to the conflict between the high memory demand and limited GPU memory. We propose an efficient distributed GNN training framework Sylvie, which employs one-bit quantization computation technique in GNNs. In detail, Sylvie provides a lightweight Low-bit Module to quantize the sent data and dequantize the received data back to full precision values in each layer.
arXiv Detail & Related papers (2023-03-02T14:02:39Z)
Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z)
Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes [9.213729275749452]
We propose an accelerated gradient method called LANS to improve the efficiency of using large mini-batches for training. It takes 54 minutes on 192 AWS EC2 P3dn.24xlarge instances to achieve a target F1 score of 90.5 or higher on SQuAD v1.1, achieving the fastest BERT training time in the cloud.
arXiv Detail & Related papers (2020-06-24T05:00:41Z)
The Limit of the Batch Size [79.8857712299211]
Large-batch training is an efficient approach for current distributed deep learning systems. In this paper, we focus on studying the limit of the batch size. We provide detailed numerical optimization instructions for step-by-step comparison.
arXiv Detail & Related papers (2020-06-15T16:18:05Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications. In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training. Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.