Related papers: Revisiting BFloat16 Training

Revisiting BFloat16 Training

URL: http://arxiv.org/abs/2010.06192v2
Date: Sun, 7 Mar 2021 06:24:07 GMT
Title: Revisiting BFloat16 Training
Authors: Pedram Zamirai, Jian Zhang, Christopher R. Aberger, Christopher De Sa
Abstract summary: State-of-the-art generic low-precision training algorithms use a mix of 16-bit and 32-bit precision. Deep learning accelerators are forced to support both 16-bit and 32-bit floating-point units.
Score: 30.99618783594963
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: State-of-the-art generic low-precision training algorithms use a mix of 16-bit and 32-bit precision, creating the folklore that 16-bit hardware compute units alone are not enough to maximize model accuracy. As a result, deep learning accelerators are forced to support both 16-bit and 32-bit floating-point units (FPUs), which is more costly than only using 16-bit FPUs for hardware design. We ask: can we train deep learning models only with 16-bit floating-point units, while still matching the model accuracy attained by 32-bit training? Towards this end, we study 16-bit-FPU training on the widely adopted BFloat16 unit. While these units conventionally use nearest rounding to cast output to 16-bit precision, we show that nearest rounding for model weight updates often cancels small updates, which degrades the convergence and model accuracy. Motivated by this, we study two simple techniques well-established in numerical analysis, stochastic rounding and Kahan summation, to remedy the model accuracy degradation in 16-bit-FPU training. We demonstrate that these two techniques can enable up to 7% absolute validation accuracy gain in 16-bit-FPU training. This leads to 0.1% lower to 0.2% higher validation accuracy compared to 32-bit training across seven deep learning applications.

Related papers

Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning [54.970571745690634]
This work presents the first systematic investigation into how numerical precision affects Large Language Models inference.<n>We develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32.<n>Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32.
arXiv Detail & Related papers (2025-06-11T08:23:53Z)
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear. We conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4 quantization across academic benchmarks and real-world tasks.
arXiv Detail & Related papers (2024-11-04T18:21:59Z)
FP8-LM: Training FP8 Large Language Models [47.17804713425323]
In this paper, we propose a new FP8 automatic mixed-precision framework for training large language models. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework.
arXiv Detail & Related papers (2023-10-27T17:59:51Z)
Standalone 16-bit Training: Missing Study for Hardware-Limited Deep Learning Practitioners [2.075190620803526]
Mixed precision techniques leverage different numerical precisions during model training and inference to optimize resource usage. For many with limited resources, the available options are restricted to using 32-bit, 16-bit, or a combination of the two. This study fills a critical gap, proving for the first time that standalone 16-bit precision neural networks match 32-bit and mixed-precision in accuracy.
arXiv Detail & Related papers (2023-05-18T13:09:45Z)
Stable and low-precision training for large-scale vision-language models [108.62077651227607]
We introduce new methods for accelerating and stabilizing training for large language-vision models. For acceleration, we introduce SwitchBack, a linear layer for int8 quantized training which provides a speed-up of 13-25%. For stability, we analyze loss spikes and find they consistently occur 1-8 after the squared gradients become under-estimated.
arXiv Detail & Related papers (2023-04-25T17:38:18Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
The case for 4-bit precision: k-bit Inference Scaling Laws [75.4335600212427]
Quantization methods reduce the number of bits required to represent each parameter in a model. The final model size depends on both the number of parameters of the original model and the rate of compression. We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision.
arXiv Detail & Related papers (2022-12-19T18:48:33Z)
FP8 Formats for Deep Learning [49.54015320992368]
We propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings. E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.
arXiv Detail & Related papers (2022-09-12T17:39:55Z)
BMPQ: Bit-Gradient Sensitivity Driven Mixed-Precision Quantization of DNNs from Scratch [11.32458063021286]
This paper presents BMPQ, a training method that uses bit gradients to analyze layer sensitivities and yield mixed-precision quantized models. It requires a single training iteration but does not need a pre-trained baseline. Compared to the baseline FP-32 models, BMPQ can yield models that have 15.4x fewer parameter bits with a negligible drop in accuracy.
arXiv Detail & Related papers (2021-12-24T03:16:58Z)
8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values. This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters. In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z)
PositNN: Training Deep Neural Networks with Mixed Low-Precision Posit [5.534626267734822]
The presented research aims to evaluate the feasibility to train deep convolutional neural networks using posits. A software framework was developed to use simulated posits and quires in end-to-end training and inference. Results suggest that 8-bit posits can substitute 32-bit floats during training with no negative impact on the resulting loss and accuracy.
arXiv Detail & Related papers (2021-04-30T19:30:37Z)
FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference [1.1292678337479967]
fbgemm is a high-performance kernel library for high-performance quantized inference on current generation CPUs. fbgemm achieves efficiency by fusing common quantization operations with a high-performance gemm implementation and by shape- and size-specific kernel code generation at runtime. The library has been deployed at Facebook, where it delivers greater than 2x performance gains with respect to our current production baseline.
arXiv Detail & Related papers (2021-01-13T00:34:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.