HiFloat4 Format for Language Model Inference
- URL: http://arxiv.org/abs/2602.11287v2
- Date: Fri, 13 Feb 2026 05:28:01 GMT
- Title: HiFloat4 Format for Language Model Inference
- Authors: Yuanyong Luo, Jing Huang, Yu Cheng, Ziwei Yu, Kaihua Tang, Xinda Ma, Xin Wang, Anping Tong, Guipeng Hu, Yun Xu, Mehran Taghian, Peng Wu, Guanglin Li, Yunke Peng, Tianchi Hu, Minqi Chen, Michael Bi Mi, Hu Liu, Xiping Zhou, Junsong Wang, Qiang Lin, Heng Liao,
- Abstract summary: This paper introduces HiFloat4 (HiF4), a block floating-point data format tailored for deep learning.<n>Each HiF4 unit packs 64 4-bit elements with 32 bits of shared scaling metadata, averaging 4.5 bits per value.<n>Results show that HiF4 achieves higher average accuracy than the state-of-the-art NVFP4 format across multiple models and diverse downstream tasks.
- Score: 25.863121704892734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces HiFloat4 (HiF4), a block floating-point data format tailored for deep learning. Each HiF4 unit packs 64 4-bit elements with 32 bits of shared scaling metadata, averaging 4.5 bits per value. The metadata specifies a three-level scaling hierarchy, capturing inter- and intra-group dynamic range while improving the utilization of the representational space. In addition, the large 64-element group size enables matrix multiplications to be executed in a highly fixed-point manner, significantly reducing hardware area and power consumption. To evaluate the proposed format, we conducted inference experiments on several language models, including LLaMA, Qwen, Mistral, DeepSeek-V3.1 and LongCat. Results show that HiF4 achieves higher average accuracy than the state-of-the-art NVFP4 format across multiple models and diverse downstream tasks.
Related papers
- Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats [42.6259787270868]
We evaluate HiFloat (HiF8 and HiF4), a family of formats tailored for Ascend NPUs.<n>HiFloat is fully compatible with state-of-the-art post-training quantization frameworks.
arXiv Detail & Related papers (2026-02-13T05:41:31Z) - Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling [13.357423392911036]
We introduce Four Over Six (4/6), a modification to the NVFP4 quantization algorithm that evaluates two potential scale factors for each block of values.<n>We find that for some blocks, scaling to smaller FP4 values makes the distribution of representable values more uniform.<n>We also find that 4/6 can be easily incorporated into many different post-training quantization methods and generally improves downstream accuracy.
arXiv Detail & Related papers (2025-12-01T18:59:45Z) - MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe [68.04078852416248]
MiniCPM-V 4.5 is an 8B parameter model designed for high efficiency and strong performance.<n>We introduce three core improvements in model architecture, data strategy and training method.<n>MiniCPM-V 4.5 achieves state-of-the-art performance among models under 30B size.
arXiv Detail & Related papers (2025-09-16T19:41:48Z) - Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs [195.24565517943802]
We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models.<n>Phi-4-Mini is a 3.8-billion- parameter language model trained on high-quality web and synthetic data.<n>Phi-4-Multimodal is a multimodal model that integrates text, vision, and speech/audio input modalities into a single model.
arXiv Detail & Related papers (2025-03-03T17:05:52Z) - ShareGPT4V: Improving Large Multi-Modal Models with Better Captions [81.95879920888716]
We introduce ShareGPT4V, a dataset featuring 1.2 million descriptive captions.
This dataset surpasses existing datasets in diversity and information content, covering world knowledge, object properties, spatial relationships, and aesthetic evaluations.
We further incorporate ShareGPT4V data into both the pre-training and SFT phases, obtaining ShareGPT4V-7B, a superior LMM.
arXiv Detail & Related papers (2023-11-21T18:58:11Z) - Efficient Post-training Quantization with FP8 Formats [14.543387418837154]
We study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures.
E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks.
arXiv Detail & Related papers (2023-09-26T00:58:36Z) - A Multi-dimensional Deep Structured State Space Approach to Speech
Enhancement Using Small-footprint Models [45.90759340302879]
We explore several S4-based deep architectures in time (T) and time-frequency (TF) domains.
The proposed TF-domain S4-based model is 78.6% smaller in size, yet it still achieves competitive results with a PESQ score of 3.15 with data augmentation.
arXiv Detail & Related papers (2023-06-01T04:19:57Z) - FP8 Formats for Deep Learning [49.54015320992368]
We propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings.
E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs.
We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.
arXiv Detail & Related papers (2022-09-12T17:39:55Z) - EFloat: Entropy-coded Floating Point Format for Deep Learning [2.3204178451683264]
EFloat format encodes frequent exponent values with Huffman codes to minimize the average exponent field width.
The proposed encoding concept may be beneficial to low-precision formats including 8-bit floats.
arXiv Detail & Related papers (2021-02-04T15:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.