Related papers: Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models

Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models

URL: http://arxiv.org/abs/2412.19821v1
Date: Sun, 15 Dec 2024 22:18:20 GMT
Title: Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models
Authors: Yun-Chen Lo, Gu-Yeon Wei, David Brooks,
Abstract summary: Nanoscaling (NxFP) proposes three techniques to enable better accuracy and smaller memory footprint than state-of-the-art MxFP.<n>NxFP reduces memory footprint by up to 16% while achieving comparable perplexity as MxFP.
Score: 5.680646377552021
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As cutting-edge large language models (LLMs) continue to transform various industries, their fast-growing model size and sequence length have led to memory traffic and capacity challenges. Recently, AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm have proposed a Microscaling standard (Mx), which augments block floating-point with microexponents to achieve promising perplexity-to-footprint trade-offs. However, the Microscaling suffers from significant perplexity degradation on modern LLMs with less than six bits. This paper profiles modern LLMs and identifies three main challenges of low-bit Microscaling format, i.e., inaccurate tracking of outliers, vacant quantization levels, and wasted binary code. In response, Nanoscaling (NxFP) proposes three techniques, i.e., NanoMantissa, Adaptive Microexponent, and Code Recycling to enable better accuracy and smaller memory footprint than state-of-the-art MxFP. Experimental results on direct-cast inference across various modern LLMs demonstrate that our proposed methods outperform state-of-the-art MxFP by up to 0.64 in perplexity and by up to 30% in accuracy on MMLU benchmarks. Furthermore, NxFP reduces memory footprint by up to 16% while achieving comparable perplexity as MxFP.

Related papers

NanoCockpit: Performance-optimized Application Framework for AI-based Autonomous Nanorobotics [50.594459728605734]
Small form factor, i.e., a few 10s grams, severely limits onboard computational resources to sub-SI100milliwatt microcontroller units (MCUs)<n>Our framework achieves ideal end-to-end latency, i.e. zero overhead due to serialized tasks, delivering quantifiable improvements in closed-loop control performance.
arXiv Detail & Related papers (2026-01-12T12:29:38Z)
ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs [4.431548809730958]
ARCQuant is a framework that boosts NVFP4 performance via Augmented Residual Channels.<n>We show that ARCQuant achieves state-of-the-art accuracy, comparable to full-precision baselines in perplexity and downstream tasks.
arXiv Detail & Related papers (2026-01-12T12:27:22Z)
SafeCiM: Investigating Resilience of Hybrid Floating-Point Compute-in-Memory Deep Learning Accelerators [2.5725493704343063]
We study the vulnerability of FP-CiM to hardware faults, posing a major reliability concern in mission-critical settings.<n>We propose a fault-resilient design, SafeCiM, that mitigates fault impact far better than a naive FP-CiM with a pre-alignment stage.
arXiv Detail & Related papers (2025-11-23T01:06:01Z)
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats [51.72056104795248]
Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats.<n>This paper systematically investigates the trade-offs between FP and integer (INT) formats.<n>We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced.
arXiv Detail & Related papers (2025-10-29T15:11:53Z)
MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving [4.176741972965246]
Reduced-precision data formats are crucial for cost-effective serving of large language models (LLMs)<n>In this paper, we focus on recent industry-driven variants of block floating-point (BFP) formats.<n>We propose MX+, a cost-effective and non-intrusive extension designed for seamless integration into the microscaling (MX) formats.
arXiv Detail & Related papers (2025-10-16T11:05:54Z)
Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization [77.67818998672516]
We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization.<n>We introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm.<n>We show that MR-GPTQ matches or outperforms state-of-the-art accuracy.
arXiv Detail & Related papers (2025-09-27T09:22:21Z)
MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models [3.305409455598179]
Quantization significantly accelerates inference in large language models (LLMs)<n>Recent advances in weight-activation quantization have primarily focused on mapping both weights and activations to the INT4 format.<n>We propose MicroMix, a co-designed mixed-precision quantization algorithm and matrix multiplication kernel based on Microscaling (MX) data formats.
arXiv Detail & Related papers (2025-08-04T12:22:39Z)
Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator [5.985414012866983]
Large language model (LLM) pruning with fixed N:M structured sparsity limits the expressivity of the sparse model. We present a flexible layer-wise outlier-density-aware N:M sparsity (FLOW) selection method. We then introduce a flexible, low-overhead digital compute-in-memory architecture (FlexCiM)
arXiv Detail & Related papers (2025-04-19T17:47:01Z)
Quantizing Large Language Models for Code Generation: A Differentiated Replication [51.85505914274633]
Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language. LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint. New frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70%.
arXiv Detail & Related papers (2025-03-10T09:26:08Z)
The Power of Negative Zero: Datatype Customization for Quantized Large Language Models [5.503925076208333]
Post-training quantization serves as one of the most hardware-efficient methods to mitigate the memory and computational demands of large language models (LLMs) In this paper, we extend the basic FP datatype to perform Redundant Zero Remapping (RaZeR) RaZeR remaps the negative zero FP encoding to a set of pre-defined special values to maximally utilize FP quantization encodings and to better fit numerical distributions.
arXiv Detail & Related papers (2025-01-06T22:40:40Z)
MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization [6.456189487006878]
Quantization of foundational models (FMs) is challenging due to the emergence of large magnitude features called outliers.<n>Existing outlier-aware algorithm/architecture co-design techniques either use mixed-precision, retaining outliers at high precision but compromise hardware efficiency, or quantize inliers and outliers at the same precision.<n>We propose MicroScopiQ, a novel co-design technique that leverages pruning to complement outlier-aware quantization.
arXiv Detail & Related papers (2024-11-08T02:25:45Z)
Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding. PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models. Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z)
OPAL: Outlier-Preserved Microscaling Quantization Accelerator for Generative Large Language Models [0.562479170374811]
We present a hardware-software co-design method that results in an energy-efficient LLM accelerator, named OPAL, for generation tasks. OPAL uses log2-based approximation on softmax operations that only requires shift and subtraction to maximize power efficiency. As a result, we are able to improve the energy efficiency by 1.62.2x, and reduce the area by 2.43.1x with negligible accuracy loss.
arXiv Detail & Related papers (2024-09-06T02:33:20Z)
Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs. At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z)
DB-LLM: Accurate Dual-Binarization for Efficient LLMs [83.70686728471547]
Large language models (LLMs) have significantly advanced the field of natural language processing. Existing ultra-low-bit quantization always causes severe accuracy drops. We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
arXiv Detail & Related papers (2024-02-19T09:04:30Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed FP16-INT8 Post-Training Quantization [0.0]
Speech Enhancement (SE) algorithms based on Recurrent Neural Networks (RNNs) are deployed on a state-of-the-art MicroController Unit (MCU) We propose an optimized software pipeline interleaving parallel computation of LSTM or GRU recurrent blocks with manually-managed memory transfers. Experiments are conducted on multiple LSTM and GRU based SE models trained on the Valentini dataset, featuring up to 1.24M parameters.
arXiv Detail & Related papers (2022-10-14T10:32:05Z)
MicroNet: Towards Image Recognition with Extremely Low FLOPs [117.96848315180407]
MicroNet is an efficient convolutional neural network using extremely low computational cost. A family of MicroNets achieve a significant performance gain over the state-of-the-art in the low FLOP regime. For instance, MicroNet-M1 achieves 61.1% top-1 accuracy on ImageNet classification with 12 MFLOPs, outperforming MobileNetV3 by 11.3%.
arXiv Detail & Related papers (2020-11-24T18:59:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.