Related papers: Microscaling Data Formats for Deep Learning

Microscaling Data Formats for Deep Learning

URL: http://arxiv.org/abs/2310.10537v3
Date: Thu, 19 Oct 2023 16:38:33 GMT
Title: Microscaling Data Formats for Deep Learning
Authors: Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, Stosic Dusan, Venmugil Elango, Maximilian Golub, Alexander Heinecke, Phil James-Roxby, Dharmesh Jani, Gaurav Kolhe, Martin Langhammer, Ada Li, Levi Melnick, Maral Mesmakhosroshahi, Andres Rodriguez, Michael Schulte, Rasoul Shafipour, Lei Shao, Michael Siu, Pradeep Dubey, Paulius Micikevicius, Maxim Naumov, Colin Verrilli, Ralph Wittig, Doug Burger, Eric Chung
Abstract summary: Narrow bit-width data formats are key to reducing the computational and storage costs of modern deep learning applications. This paper evaluates Microscaling (MX) data formats that combine a per-block scaling factor with narrow floating-point and integer types for individual elements.
Score: 29.70183999642415
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Narrow bit-width data formats are key to reducing the computational and storage costs of modern deep learning applications. This paper evaluates Microscaling (MX) data formats that combine a per-block scaling factor with narrow floating-point and integer types for individual elements. MX formats balance the competing needs of hardware efficiency, model accuracy, and user friction. Empirical results on over two dozen benchmarks demonstrate practicality of MX data formats as a drop-in replacement for baseline FP32 for AI inference and training with low user friction. We also show the first instance of training generative language models at sub-8-bit weights, activations, and gradients with minimal accuracy loss and no modifications to the training recipe.

Related papers

Characterization and Mitigation of Training Instabilities in Microscaling Formats [6.025438902954768]
Training large language models is an expensive, compute-bound process.<n>Next-generation hardware accelerators increasingly support lower-precision arithmetic formats.<n>We investigate the challenges and viability of block-scaled precision formats during model training.
arXiv Detail & Related papers (2025-06-25T18:25:08Z)
Recipes for Pre-training LLMs with MXFP8 [0.0]
Precision scaling has emerged as a compelling technique for improving GPU efficiency without sacrificing accuracy.<n>MX-formats offer improved numeric stability compared to other reduced-precision representations.<n>We show an improved rounding mode, which uses round-to-infinity to compute scaling factors, enables successful pre-training in MXFP8 for an 8B model on 15T tokens.
arXiv Detail & Related papers (2025-05-30T21:08:15Z)
PRIOT: Pruning-Based Integer-Only Transfer Learning for Embedded Systems [1.4779899760345436]
We propose a new training method named PRIOT, which optimize the network by pruning selected edges rather than updating weights. We implement PRIOT and PRIOT-S on the Raspberry Pi Pico and evaluate their accuracy and computational costs. Our results demonstrate that PRIOT improves accuracy by 8.08 to 33.75 percentage points over existing methods, while PRIOT-S reduces memory footprint with minimal accuracy loss.
arXiv Detail & Related papers (2025-03-21T05:07:57Z)
Scaling Laws for Floating Point Quantization Training [47.174957621592775]
This paper explores the effects of FP quantization targets, exponent bits, mantissa bits, and the calculation of the scaling factor in FP quantization training performance of LLM models.<n>We provide the optimal exponent-mantissa bit ratio for different bit numbers, which is available for future reference by hardware manufacturers.
arXiv Detail & Related papers (2025-01-05T02:30:41Z)
Direct Quantized Training of Language Models with Stochastic Rounding [12.028887152979046]
Experimental results on LLaMA-structured models of various sizes indicate that training with lowprecision weights is feasible even when constrained to ternary values.<n>Our models remain robust to precision scaling and memory reduction, showing minimal performance degradation when moving from FP32 to lower-memory environments.
arXiv Detail & Related papers (2024-12-06T05:41:11Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Large Language Model (LLM) pretraining traditionally relies on autoregressive language modeling on randomly sampled data blocks from web-scale datasets. We take inspiration from human learning techniques like spaced repetition to hypothesize that random data sampling for LLMs leads to high training cost and low quality models which tend to forget data. In order to effectively commit web-scale information to long-term memory, we propose the LFR (Learn, Focus, and Review) pedagogy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z)
Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes. Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together. We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z)
Incrementally-Computable Neural Networks: Efficient Inference for Dynamic Inputs [75.40636935415601]
Deep learning often faces the challenge of efficiently processing dynamic inputs, such as sensor data or user inputs. We take an incremental computing approach, looking to reuse calculations as the inputs change. We apply this approach to the transformers architecture, creating an efficient incremental inference algorithm with complexity proportional to the fraction of modified inputs.
arXiv Detail & Related papers (2023-07-27T16:30:27Z)
Scaling Data-Constrained Language Models [137.17302576977346]
We investigate scaling language models in data-constrained regimes. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters.
arXiv Detail & Related papers (2023-05-25T17:18:55Z)
8-bit Numerical Formats for Deep Neural Networks [1.304892050913381]
We present an in-depth study on the use of 8-bit floating-point number formats for activations, weights, and gradients for both training and inference. Experiments demonstrate that a suitable choice of these low-precision formats enables faster training and reduced power consumption without any degradation in accuracy for a range of deep learning models for image classification and language processing.
arXiv Detail & Related papers (2022-06-06T21:31:32Z)
FairIF: Boosting Fairness in Deep Learning via Influence Functions with Validation Set Sensitive Attributes [51.02407217197623]
We propose a two-stage training algorithm named FAIRIF. It minimizes the loss over the reweighted data set where the sample weights are computed. We show that FAIRIF yields models with better fairness-utility trade-offs against various types of bias.
arXiv Detail & Related papers (2022-01-15T05:14:48Z)
PositNN: Training Deep Neural Networks with Mixed Low-Precision Posit [5.534626267734822]
The presented research aims to evaluate the feasibility to train deep convolutional neural networks using posits. A software framework was developed to use simulated posits and quires in end-to-end training and inference. Results suggest that 8-bit posits can substitute 32-bit floats during training with no negative impact on the resulting loss and accuracy.
arXiv Detail & Related papers (2021-04-30T19:30:37Z)
All-You-Can-Fit 8-Bit Flexible Floating-Point Format for Accurate and Memory-Efficient Inference of Deep Neural Networks [2.294014185517203]
This paper introduces an extremely flexible 8-bit floating-point (FFP8) format. It achieves an extremely low accuracy loss of $0.1%sim 0.3%$ for several representative image classification models. It is easy to turn a classical floating-point processing unit into an FFP8-compliant one, and the extra hardware cost is minor.
arXiv Detail & Related papers (2021-04-15T09:37:23Z)
Revisiting BFloat16 Training [30.99618783594963]
State-of-the-art generic low-precision training algorithms use a mix of 16-bit and 32-bit precision. Deep learning accelerators are forced to support both 16-bit and 32-bit floating-point units.
arXiv Detail & Related papers (2020-10-13T05:38:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.