Related papers: Recipes for Pre-training LLMs with MXFP8

Recipes for Pre-training LLMs with MXFP8

URL: http://arxiv.org/abs/2506.08027v1
Date: Fri, 30 May 2025 21:08:15 GMT
Title: Recipes for Pre-training LLMs with MXFP8
Authors: Asit Mishra, Dusan Stosic, Simon Layton,
Abstract summary: Precision scaling has emerged as a compelling technique for improving GPU efficiency without sacrificing accuracy.<n>MX-formats offer improved numeric stability compared to other reduced-precision representations.<n>We show an improved rounding mode, which uses round-to-infinity to compute scaling factors, enables successful pre-training in MXFP8 for an 8B model on 15T tokens.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Precision scaling - using fewer bits to represent model parameters and related tensors during pre-training - has emerged as a compelling technique for improving GPU efficiency without sacrificing accuracy. Microscaling (MX) formats in NVIDIA's latest Blackwell GPUs represent a major leap in enabling this precision scaling aspect. These formats combine narrow floating-point data types with per-block scaling factors, offering a fine-grained approach to quantizing tensors. Although MX-formats offer the promise of improved numeric stability compared to other reduced-precision representations, in practice they must be used carefully in order to successfully converge an LLM on a multi-trillion token dataset. In this paper, we show that the rounding mode suggested in OCP specification can lead to divergence when pre-training an LLM. We show an improved rounding mode, which uses round-to-infinity to compute scaling factors, enables successful pre-training in MXFP8 for an 8B model on 15T tokens.

Related papers

Characterization and Mitigation of Training Instabilities in Microscaling Formats [6.025438902954768]
Training large language models is an expensive, compute-bound process.<n>Next-generation hardware accelerators increasingly support lower-precision arithmetic formats.<n>We investigate the challenges and viability of block-scaled precision formats during model training.
arXiv Detail & Related papers (2025-06-25T18:25:08Z)
Towards Fully FP8 GEMM LLM Training at Scale [77.39425361120466]
Existing approaches often rely on suboptimal fine-grained FP8 kernels or fall back to higher-precision matrix multiplications.<n>We introduce a new class of LLM architectures that, for the first time, support FP8 computation for all GEMMs within transformer blocks during both forward and backward passes.<n>This enables unprecedented throughput gains, particularly at scale, while matching the downstream performance of standard BF16 training.
arXiv Detail & Related papers (2025-05-26T21:04:14Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
Scaling Laws for Floating Point Quantization Training [47.174957621592775]
This paper explores the effects of FP quantization targets, exponent bits, mantissa bits, and the calculation of the scaling factor in FP quantization training performance of LLM models.<n>We provide the optimal exponent-mantissa bit ratio for different bit numbers, which is available for future reference by hardware manufacturers.
arXiv Detail & Related papers (2025-01-05T02:30:41Z)
Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference [55.150117654242706]
We show that model selection for computation-aware GPs trained on 1.8 million data points can be done within a few hours on a single GPU. As a result of this work, Gaussian processes can be trained on large-scale datasets without significantly compromising their ability to quantify uncertainty.
arXiv Detail & Related papers (2024-11-01T21:11:48Z)
Scalify: scale propagation for efficient low-precision LLM training [1.4999444543328293]
Low-precision formats such as float8 have been introduced in machine learning accelerated hardware to improve computational efficiency for large language models training and inference. We present Scalify, a end-to-end scale propagation paradigm for computational graphs.
arXiv Detail & Related papers (2024-07-24T15:26:01Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs. We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z)
Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models [121.0693322732454]
This paper proposes a textbfCraFT' approach for fine-tuning black-box vision-language models to downstream tasks. CraFT comprises two modules, a prompt generation module for learning text prompts and a prediction refinement module for enhancing output predictions in residual style. Experiments on few-shot classification over 15 datasets demonstrate the superiority of CraFT.
arXiv Detail & Related papers (2024-02-06T14:53:19Z)
Microscaling Data Formats for Deep Learning [29.70183999642415]
Narrow bit-width data formats are key to reducing the computational and storage costs of modern deep learning applications. This paper evaluates Microscaling (MX) data formats that combine a per-block scaling factor with narrow floating-point and integer types for individual elements.
arXiv Detail & Related papers (2023-10-16T16:07:41Z)
All-You-Can-Fit 8-Bit Flexible Floating-Point Format for Accurate and Memory-Efficient Inference of Deep Neural Networks [2.294014185517203]
This paper introduces an extremely flexible 8-bit floating-point (FFP8) format. It achieves an extremely low accuracy loss of $0.1%sim 0.3%$ for several representative image classification models. It is easy to turn a classical floating-point processing unit into an FFP8-compliant one, and the extra hardware cost is minor.
arXiv Detail & Related papers (2021-04-15T09:37:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.