Training LLMs with MXFP4
- URL: http://arxiv.org/abs/2502.20586v1
- Date: Thu, 27 Feb 2025 23:01:31 GMT
- Title: Training LLMs with MXFP4
- Authors: Albert Tseng, Tao Yu, Youngsuk Park,
- Abstract summary: We present the first near-lossless training recipe that uses MXFP4 GEMMs, which are $2times$ faster than FP8 on supported hardware.<n>Our recipe computes $>1/2$ the training FLOPs in MXFP4, enabling an estimated speedup of $>1.3times$ over FP8 and $>1.7times$ over BF16 during backpropagation.
- Score: 15.084813381461903
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Low precision (LP) datatypes such as MXFP4 can accelerate matrix multiplications (GEMMs) and reduce training costs. However, directly using MXFP4 instead of BF16 during training significantly degrades model quality. In this work, we present the first near-lossless training recipe that uses MXFP4 GEMMs, which are $2\times$ faster than FP8 on supported hardware. Our key insight is to compute unbiased gradient estimates with stochastic rounding (SR), resulting in more accurate model updates. However, directly applying SR to MXFP4 can result in high variance from block-level outliers, harming convergence. To overcome this, we use the random Hadamard tranform to theoretically bound the variance of SR. We train GPT models up to 6.7B parameters and find that our method induces minimal degradation over mixed-precision BF16 training. Our recipe computes $>1/2$ the training FLOPs in MXFP4, enabling an estimated speedup of $>1.3\times$ over FP8 and $>1.7\times$ over BF16 during backpropagation.
Related papers
- ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts [79.62448915248926]
Speculative decoding (SD) has emerged as a method to accelerate LLM inference without sacrificing accuracy over the 16-bit model inference.
We propose using MXFP4 models as drafts in a plug-and-play fashion since the MXFP4 Weight-Only-Quantization (WOQ) merely direct-casts the BF16 target model weights to MXFP4.
In practice, our plug-and-play solution gives speedups up to 2x over the BF16 baseline.
arXiv Detail & Related papers (2025-03-17T08:38:45Z) - Oscillation-Reduced MXFP4 Training for Vision Transformers [19.642508885867375]
Pre-training Transformers in FP4 precision comes with a considerable loss of accuracy.<n>Training with MXFP4 data format still results in significant degradation.<n>We propose a novel training method TetraJet for a more accurate FP4 training.
arXiv Detail & Related papers (2025-02-28T08:51:55Z) - Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam [94.00189300897694]
Low-bit precision amplifies sensitivity learning rates and often causes unstable gradient norms.<n>We propose Stable-SPAM, which incorporates enhanced gradient normalization and clipping techniques.<n>Experiments show that Stable-SPAM effectively stabilizes gradient norms in 4-bit training, delivering superior performance compared to Adam and SPAM.
arXiv Detail & Related papers (2025-02-24T11:09:15Z) - Towards Efficient Pre-training: Exploring FP4 Precision in Large Language Models [25.700481606604647]
Experimental results demonstrate that our FP4 training scheme achieves accuracy comparable to BF16 and FP8, with smaller theoretical computational cost.<n>With the advent of next-generation hardware supporting FP4, our method sets the foundation for efficient ultra-low precision training.
arXiv Detail & Related papers (2025-02-17T05:33:11Z) - Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs.<n>This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z) - Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget [53.311109531586844]
We demonstrate very low-cost training of large-scale T2I diffusion transformer models.
We train a 1.16 billion parameter sparse transformer with only $1,890 economical cost and achieve a 12.7 FID in zero-shot generation.
We aim to release our end-to-end training pipeline to further democratize the training of large-scale diffusion models on micro-budgets.
arXiv Detail & Related papers (2024-07-22T17:23:28Z) - To FP8 and Back Again: Quantifying the Effects of Reducing Precision on LLM Training Stability [7.115739465137031]
BrainFloat16 (BF16) precision has become the de facto standard for large language model pretraining.
However, prior experience with FP16, which was found to be less stable than BF16, raises concerns as to whether FP8 can be a cost-effective option for LLM training.
We propose new evaluation techniques and a new metric for quantifying loss landscape sharpness in autoregressive language models.
arXiv Detail & Related papers (2024-05-29T02:42:23Z) - FP8-LM: Training FP8 Large Language Models [47.17804713425323]
In this paper, we propose a new FP8 automatic mixed-precision framework for training large language models.
Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework.
arXiv Detail & Related papers (2023-10-27T17:59:51Z) - Instant Complexity Reduction in CNNs using Locality-Sensitive Hashing [50.79602839359522]
We propose HASTE (Hashing for Tractable Efficiency), a parameter-free and data-free module that acts as a plug-and-play replacement for any regular convolution module.
We are able to drastically compress latent feature maps without sacrificing much accuracy by using locality-sensitive hashing (LSH)
In particular, we are able to instantly drop 46.72% of FLOPs while only losing 1.25% accuracy by just swapping the convolution modules in a ResNet34 on CIFAR-10 for our HASTE module.
arXiv Detail & Related papers (2023-09-29T13:09:40Z) - Stable and low-precision training for large-scale vision-language models [108.62077651227607]
We introduce new methods for accelerating and stabilizing training for large language-vision models.
For acceleration, we introduce SwitchBack, a linear layer for int8 quantized training which provides a speed-up of 13-25%.
For stability, we analyze loss spikes and find they consistently occur 1-8 after the squared gradients become under-estimated.
arXiv Detail & Related papers (2023-04-25T17:38:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.