Unit Scaling: Out-of-the-Box Low-Precision Training
- URL: http://arxiv.org/abs/2303.11257v2
- Date: Tue, 30 May 2023 22:05:40 GMT
- Title: Unit Scaling: Out-of-the-Box Low-Precision Training
- Authors: Charlie Blake, Douglas Orr, Carlo Luschi
- Abstract summary: Unit scaling is a paradigm for designing deep learning models that simplifies the use of low-precision number formats.
Training in FP16 or the recently proposed FP8 formats offers substantial efficiency gains, but can lack sufficient range for out-of-the-box training.
Unit scaling addresses this by introducing a principled approach to model numerics: seeking unit variance of all weights, activations and gradients at initialisation.
- Score: 1.7188280334580197
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present unit scaling, a paradigm for designing deep learning models that
simplifies the use of low-precision number formats. Training in FP16 or the
recently proposed FP8 formats offers substantial efficiency gains, but can lack
sufficient range for out-of-the-box training. Unit scaling addresses this by
introducing a principled approach to model numerics: seeking unit variance of
all weights, activations and gradients at initialisation. Unlike alternative
methods, this approach neither requires multiple training runs to find a
suitable scale nor has significant computational overhead. We demonstrate the
efficacy of unit scaling across a range of models and optimisers. We further
show that existing models can be adapted to be unit-scaled, training BERT-Large
in FP16 and then FP8 with no degradation in accuracy.
Related papers
- Compact Language Models via Pruning and Knowledge Distillation [61.56557874432008]
Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch.
Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch.
arXiv Detail & Related papers (2024-07-19T21:47:57Z) - Test-Time Model Adaptation with Only Forward Passes [68.11784295706995]
Test-time adaptation has proven effective in adapting a given trained model to unseen test samples with potential distribution shifts.
We propose a test-time Forward-Optimization Adaptation (FOA) method.
FOA runs on quantized 8-bit ViT, outperforms gradient-based TENT on full-precision 32-bit ViT, and achieves an up to 24-fold memory reduction on ImageNet-C.
arXiv Detail & Related papers (2024-04-02T05:34:33Z) - FP8-BERT: Post-Training Quantization for Transformer [20.51143486483669]
Transformer-based models, such as BERT, require massive memory storage and inference cost when deployed in production.
New numeric format FP8 has been proposed and supported in commercial AI computing platforms such as H100.
We empirically validate the effectiveness of FP8 as a way to do Post-Training Quantization without significant loss of accuracy.
arXiv Detail & Related papers (2023-12-10T02:14:34Z) - Initializing Models with Larger Ones [76.41561758293055]
We introduce weight selection, a method for initializing smaller models by selecting a subset of weights from a pretrained larger model.
Our experiments demonstrate that weight selection can significantly enhance the performance of small models and reduce their training time.
arXiv Detail & Related papers (2023-11-30T18:58:26Z) - FP8-LM: Training FP8 Large Language Models [47.17804713425323]
In this paper, we propose a new FP8 automatic mixed-precision framework for training large language models.
Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework.
arXiv Detail & Related papers (2023-10-27T17:59:51Z) - Training and inference of large language models using 8-bit floating
point [3.689110902209004]
This paper presents a methodology to select the scalings for FP8 linear layers, based on dynamically updating per-tensor scales for the weights, gradients and activations.
We apply this methodology to train and validate large language models of the type of GPT and Llama 2 using FP8, for model sizes ranging from 111M to 70B.
arXiv Detail & Related papers (2023-09-29T13:24:33Z) - Pruning Large Language Models via Accuracy Predictor [0.0]
Large language models (LLMs) containing tens of billions of parameters (or even more) have demonstrated impressive capabilities in various NLP tasks.
We propose a novel pruning approach: firstly, a training set of a certain number of architecture-accuracy pairs is established, and then a non-neural model is trained as an accuracy predictor.
arXiv Detail & Related papers (2023-09-18T06:38:24Z) - All-You-Can-Fit 8-Bit Flexible Floating-Point Format for Accurate and
Memory-Efficient Inference of Deep Neural Networks [2.294014185517203]
This paper introduces an extremely flexible 8-bit floating-point (FFP8) format.
It achieves an extremely low accuracy loss of $0.1%sim 0.3%$ for several representative image classification models.
It is easy to turn a classical floating-point processing unit into an FFP8-compliant one, and the extra hardware cost is minor.
arXiv Detail & Related papers (2021-04-15T09:37:23Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.