Related papers: Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats

Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats

URL: http://arxiv.org/abs/2602.12635v1
Date: Fri, 13 Feb 2026 05:41:31 GMT
Title: Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats
Authors: Pengxiang Zhao, Hui-Ling Zhen, Xing Li, Han Bao, Weizhe Lin, Zhiyuan Yang, Ziwei Yu, Xin Wang, Mingxuan Yuan, Xianzhi Yu, Zhenhua Dong,
Abstract summary: We evaluate HiFloat (HiF8 and HiF4), a family of formats tailored for Ascend NPUs.<n>HiFloat is fully compatible with state-of-the-art post-training quantization frameworks.
Score: 42.6259787270868
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As LLMs scale, low-bit floating-point formats like MXFP and NVFP4 offer new opportunities for precision and efficiency. In this work, we evaluate HiFloat (HiF8 and HiF4), a family of formats tailored for Ascend NPUs. Through rigorous comparison across weight-activation and KV-cache tasks, we provide three key insights: (1) INT8 suits narrow-range data, while floating-point formats excel with high-variance data; (2) in 4-bit regimes, HiF4's hierarchical scaling prevents the accuracy collapse seen in integer formats; and (3) HiFloat is fully compatible with state-of-the-art post-training quantization frameworks. Overall, HiFloat provides a solution for high-efficiency LLM inference on NPUs.

Related papers

HiFloat4 Format for Language Model Inference [25.863121704892734]
This paper introduces HiFloat4 (HiF4), a block floating-point data format tailored for deep learning.<n>Each HiF4 unit packs 64 4-bit elements with 32 bits of shared scaling metadata, averaging 4.5 bits per value.<n>Results show that HiF4 achieves higher average accuracy than the state-of-the-art NVFP4 format across multiple models and diverse downstream tasks.
arXiv Detail & Related papers (2026-02-11T19:07:36Z)
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling [13.357423392911036]
We introduce Four Over Six (4/6), a modification to the NVFP4 quantization algorithm that evaluates two potential scale factors for each block of values.<n>We find that for some blocks, scaling to smaller FP4 values makes the distribution of representable values more uniform.<n>We also find that 4/6 can be easily incorporated into many different post-training quantization methods and generally improves downstream accuracy.
arXiv Detail & Related papers (2025-12-01T18:59:45Z)
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats [51.72056104795248]
Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats.<n>This paper systematically investigates the trade-offs between FP and integer (INT) formats.<n>We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced.
arXiv Detail & Related papers (2025-10-29T15:11:53Z)
Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization [77.67818998672516]
We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization.<n>We introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm.<n>We show that MR-GPTQ matches or outperforms state-of-the-art accuracy.
arXiv Detail & Related papers (2025-09-27T09:22:21Z)
FP4 All the Way: Fully Quantized Training of LLMs [26.195547788434908]
We demonstrate fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision.<n>We investigate key design choices for FP4, including block sizes, scaling formats, and rounding methods.
arXiv Detail & Related papers (2025-05-25T12:14:25Z)
Over-the-Air Fair Federated Learning via Multi-Objective Optimization [52.295563400314094]
We propose an over-the-air fair federated learning algorithm (OTA-FFL) to train fair FL models.<n>Experiments demonstrate the superiority of OTA-FFL in achieving fairness and robust performance.
arXiv Detail & Related papers (2025-01-06T21:16:51Z)
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear.<n>We conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4 quantization across academic benchmarks and real-world tasks.
arXiv Detail & Related papers (2024-11-04T18:21:59Z)
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit. We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z)
ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats [25.543571445739936]
This study explores the viability of floating-point (FP) quantization for large language models (LLMs) For LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion. For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100.
arXiv Detail & Related papers (2023-07-19T06:58:03Z)
EFloat: Entropy-coded Floating Point Format for Deep Learning [2.3204178451683264]
EFloat format encodes frequent exponent values with Huffman codes to minimize the average exponent field width. The proposed encoding concept may be beneficial to low-precision formats including 8-bit floats.
arXiv Detail & Related papers (2021-02-04T15:58:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.