Accurate INT8 Training Through Dynamic Block-Level Fallback
- URL: http://arxiv.org/abs/2503.08040v2
- Date: Wed, 12 Mar 2025 03:20:28 GMT
- Title: Accurate INT8 Training Through Dynamic Block-Level Fallback
- Authors: Pengle Zhang, Jia Wei, Jintao Zhang, Jun Zhu, Jianfei Chen,
- Abstract summary: Transformer models have achieved remarkable success across various AI applications but face significant training costs.<n>We propose Fallback Quantization, implementing mixed-precision GEMM that dynamically falls back 8-bit to 16-bit for activation blocks containing outliers.<n> Experiments show that our approach is robustly competent in both fine-tuning and pretraining settings.
- Score: 21.808835887740543
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer models have achieved remarkable success across various AI applications but face significant training costs. Low-bit training, such as INT8 training, can leverage computational units with higher throughput, and has already demonstrated its effectiveness on GPT2 models with block-level quantization. However, it struggles with modern Transformer variants incorporating GLU units. This is because those variants demonstrate complex distributions of activation outliers. To address the challenge, we propose Fallback Quantization, implementing mixed-precision GEMM that dynamically falls back 8-bit to 16-bit for activation blocks containing outliers. Experiments show that our approach is robustly competent in both fine-tuning and pretraining settings. Moreover, our method achieves a 1.57x end-to-end training speedup on RTX4090 GPUs.
Related papers
- Continual Quantization-Aware Pre-Training: When to transition from 16-bit to 1.58-bit pre-training for BitNet language models? [5.67099529296254]
Large language models (LLMs) require immense resources for training and inference.<n>Recent research suggests that training LLMs with 1.58 bits per weight parameter from scratch can maintain model accuracy.
arXiv Detail & Related papers (2025-02-17T15:21:11Z) - $μ$nit Scaling: Simple and Scalable FP8 LLM Training [6.447975505471247]
Large Language Model training with 8-bit floating point (FP8) formats promises significant efficiency improvements, but reduced numerical precision makes training challenging.
We demonstrate simple, scalable FP8 training that requires no dynamic scaling factors, even at large model sizes.
We validate our method by training models from 1B to 13B parameters, performing all hidden linear layer computations in FP8.
arXiv Detail & Related papers (2025-02-09T17:31:09Z) - Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs.<n>This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z) - "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear.<n>We conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4 quantization across academic benchmarks and real-world tasks.
arXiv Detail & Related papers (2024-11-04T18:21:59Z) - COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training [47.07768822212081]
COAT (States and Activations for FP8 Training) is a novel FP8 training framework designed to significantly reduce memory footprint when training large models.<n>COAT effectively reduces end-to-end training memory footprint by 1.54x compared to BF16.<n>COAT also achieves a 1.43x end-to-end training speedup compared to BF16.
arXiv Detail & Related papers (2024-10-25T05:59:30Z) - Scaling FP8 training to trillion-token LLMs [26.195547788434908]
We train large language models using FP8 precision on datasets up to 2 trillion tokens.<n>We uncover critical instabilities in FP8 training that were not observable in earlier works with shorter durations.<n>We introduce Smooth-SwiGLU, a novel modification that ensures stable FP8 training without altering function.
arXiv Detail & Related papers (2024-09-19T07:15:58Z) - FP8-BERT: Post-Training Quantization for Transformer [20.51143486483669]
Transformer-based models, such as BERT, require massive memory storage and inference cost when deployed in production.
New numeric format FP8 has been proposed and supported in commercial AI computing platforms such as H100.
We empirically validate the effectiveness of FP8 as a way to do Post-Training Quantization without significant loss of accuracy.
arXiv Detail & Related papers (2023-12-10T02:14:34Z) - Training Transformers with 4-bit Integers [21.861232105539933]
Quantizing activation, weight, and gradient to 4-bit is promising to accelerate neural network training.
Existing 4-bit training methods require custom numerical formats which are not supported by contemporary hardware.
In this work, we propose a training method for transformers with all matrix multiplications implemented with the INT4 arithmetic.
arXiv Detail & Related papers (2023-06-21T02:45:01Z) - Stable and low-precision training for large-scale vision-language models [108.62077651227607]
We introduce new methods for accelerating and stabilizing training for large language-vision models.
For acceleration, we introduce SwitchBack, a linear layer for int8 quantized training which provides a speed-up of 13-25%.
For stability, we analyze loss spikes and find they consistently occur 1-8 after the squared gradients become under-estimated.
arXiv Detail & Related papers (2023-04-25T17:38:18Z) - FP8 Formats for Deep Learning [49.54015320992368]
We propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings.
E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs.
We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.
arXiv Detail & Related papers (2022-09-12T17:39:55Z) - FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training [62.932299614630985]
We propose FracTrain that integrates progressive fractional quantization which gradually increases the precision of activations, weights, and gradients.
FracTrain reduces computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12%+1.87%) accuracy.
arXiv Detail & Related papers (2020-12-24T05:24:10Z) - Towards Unified INT8 Training for Convolutional Neural Network [83.15673050981624]
We build a unified 8-bit (INT8) training framework for common convolutional neural networks.
First, we empirically find the four distinctive characteristics of gradients, which provide us insightful clues for gradient quantization.
We propose two universal techniques, including Direction Sensitive Gradient Clipping that reduces the direction deviation of gradients.
arXiv Detail & Related papers (2019-12-29T08:37:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.