Related papers: Better Schedules for Low Precision Training of Deep Neural Networks

Better Schedules for Low Precision Training of Deep Neural Networks

URL: http://arxiv.org/abs/2403.02243v1
Date: Mon, 4 Mar 2024 17:33:39 GMT
Title: Better Schedules for Low Precision Training of Deep Neural Networks
Authors: Cameron R. Wolfe and Anastasios Kyrillidis
Abstract summary: cyclic precision training (CPT) dynamically adjusts precision throughout training according to a cyclic schedule. CPT achieves particularly impressive improvements in training efficiency, while actually improving DNN performance.
Score: 13.88763215392452
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Low precision training can significantly reduce the computational overhead of training deep neural networks (DNNs). Though many such techniques exist, cyclic precision training (CPT), which dynamically adjusts precision throughout training according to a cyclic schedule, achieves particularly impressive improvements in training efficiency, while actually improving DNN performance. Existing CPT implementations take common learning rate schedules (e.g., cyclical cosine schedules) and use them for low precision training without adequate comparisons to alternative scheduling options. We define a diverse suite of CPT schedules and analyze their performance across a variety of DNN training regimes, some of which are unexplored in the low precision training literature (e.g., node classification with graph neural networks). From these experiments, we discover alternative CPT schedules that offer further improvements in training efficiency and model performance, as well as derive a set of best practices for choosing CPT schedules. Going further, we find that a correlation exists between model performance and training cost, and that changing the underlying CPT schedule can control the tradeoff between these two variables. To explain the direct correlation between model performance and training cost, we draw a connection between quantized training and critical learning periods, suggesting that aggressive quantization is a form of learning impairment that can permanently damage model performance.

Related papers

The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training [55.233765889424035]
We show that learning-rate schedules for large model training behave surprisingly similar to a convex bound from non-smooth optimization theory. We achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.
arXiv Detail & Related papers (2025-01-31T08:55:56Z)
Always-Sparse Training by Growing Connections with Guided Stochastic Exploration [46.4179239171213]
We propose an efficient always-sparse training algorithm with excellent scaling to larger and sparser models. We evaluate our method on CIFAR-10/100 and ImageNet using VGG, and ViT models, and compare it against a range of sparsification methods.
arXiv Detail & Related papers (2024-01-12T21:32:04Z)
Online Training Through Time for Spiking Neural Networks [66.7744060103562]
Spiking neural networks (SNNs) are promising brain-inspired energy-efficient models. Recent progress in training methods has enabled successful deep SNNs on large-scale tasks with low latency. We propose online training through time (OTTT) for SNNs, which is derived from BPTT to enable forward-in-time learning.
arXiv Detail & Related papers (2022-10-09T07:47:56Z)
Effective and Efficient Training for Sequential Recommendation using Recency Sampling [91.02268704681124]
We propose a novel Recency-based Sampling of Sequences training objective. We show that the models enhanced with our method can achieve performances exceeding or very close to stateof-the-art BERT4Rec.
arXiv Detail & Related papers (2022-07-06T13:06:31Z)
LDP: Learnable Dynamic Precision for Efficient Deep Neural Network Training and Inference [24.431074439663437]
Learnable Dynamic Precision (LDP) is a framework that automatically learns a temporally and spatially dynamic precision schedule during training. LDP consistently outperforms state-of-the-art (SOTA) low precision DNN training techniques in terms of training efficiency and achieved accuracy trade-offs.
arXiv Detail & Related papers (2022-03-15T08:01:46Z)
AC/DC: Alternating Compressed/DeCompressed Training of Deep Neural Networks [78.62086125399831]
We present a general approach called Alternating Compressed/DeCompressed (AC/DC) training of deep neural networks (DNNs) AC/DC outperforms existing sparse training methods in accuracy at similar computational budgets. An important property of AC/DC is that it allows co-training of dense and sparse models, yielding accurate sparse-dense model pairs at the end of the training process.
arXiv Detail & Related papers (2021-06-23T13:23:00Z)
Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models [51.744357472072416]
We propose a method, which continually identifies the weak spots of a model to generate more valuable training instances. Experimental results show that such an adversarial training method combined with the pretraining strategy can improve both the generalization and robustness of multiple CSC models.
arXiv Detail & Related papers (2021-05-31T09:17:33Z)
CPT: Efficient Deep Neural Network Training via Cyclic Precision [19.677029887330036]
Low-precision deep neural network (DNN) training has gained tremendous attention as reducing precision is one of the most effective knobs for boosting DNNs' training time/energy efficiency. We conjecture that DNNs' precision might have a similar effect as the learning rate during DNN training, and advocate dynamic precision along the training trajectory for further boosting the time/energy efficiency of DNN training.
arXiv Detail & Related papers (2021-01-25T02:56:18Z)
FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training [81.85361544720885]
We propose FracTrain that integrates progressive fractional quantization which gradually increases the precision of activations, weights, and gradients. FracTrain reduces computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12%+1.87%) accuracy.
arXiv Detail & Related papers (2020-12-24T05:24:10Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Regularized Evolutionary Population-Based Training [11.624954122221562]
This paper presents an algorithm called Population-Based Training (EPBT) that interleaves the training of a DNN's weights with the metalearning of loss functions. EPBT results in faster, more accurate learning on image classification benchmarks.
arXiv Detail & Related papers (2020-02-11T06:28:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.