Related papers: BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs

BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs

URL: http://arxiv.org/abs/2504.18415v1
Date: Fri, 25 Apr 2025 15:17:52 GMT
Title: BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs
Authors: Hongyu Wang, Shuming Ma, Furu Wei,
Abstract summary: BitNet v2 is a framework enabling native 4-bit activation quantization for 1-bit Large Language Models.<n>H-BitLinear is a module applying an online Hadamard transformation prior to activation quantization.<n> Experiments show BitNet v2 trained from scratch with 8-bit activations matches BitNet b1.58 performance.
Score: 95.73339037243105
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Efficient deployment of 1-bit Large Language Models (LLMs) is hindered by activation outliers, which complicate quantization to low bit-widths. We introduce BitNet v2, a novel framework enabling native 4-bit activation quantization for 1-bit LLMs. To tackle outliers in attention and feed-forward network activations, we propose H-BitLinear, a module applying an online Hadamard transformation prior to activation quantization. This transformation smooths sharp activation distributions into more Gaussian-like forms, suitable for low-bit representation. Experiments show BitNet v2 trained from scratch with 8-bit activations matches BitNet b1.58 performance. Crucially, BitNet v2 achieves minimal performance degradation when trained with native 4-bit activations, significantly reducing memory footprint and computational cost for batched inference.

Related papers

Bitnet.cpp: Efficient Edge Inference for Ternary LLMs [71.5759603658299]
We introduce Bitnet, an inference system optimized for BitNet b1.58 and ternary LLMs.<n>Bitnet incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference.<n>Our experiments show that Bitnet achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines.
arXiv Detail & Related papers (2025-02-17T15:06:28Z)
BitNet a4.8: 4-bit Activations for 1-bit LLMs [95.73339037243105]
We introduce BitNet a4.8, enabling 4-bit activations for 1-bit Large Language Models. We demonstrate that BitNet a4.8 achieves performance comparable to BitNet b1.58 with equivalent training costs.
arXiv Detail & Related papers (2024-11-07T18:41:50Z)
OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs.<n>Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z)
BitNet: Scaling 1-bit Transformers for Large Language Models [119.18692348616845]
We introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption.
arXiv Detail & Related papers (2023-10-17T17:59:15Z)
DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference [28.912023025671868]
This work targets an adaptive data representation with variable-length encoding called DyBit. We also propose a hardware-aware quantization framework with a mixed-precision accelerator to trade-off the inference accuracy and speedup. Experimental results demonstrate that the inference accuracy via DyBit is 1.997% higher than the state-of-the-art at 4-bit quantization.
arXiv Detail & Related papers (2023-02-24T08:46:01Z)
Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded Chipsets [7.5195830365852085]
We propose a novel sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model. We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data.
arXiv Detail & Related papers (2022-07-13T17:46:08Z)
Post-Training Sparsity-Aware Quantization [2.2530496464901106]
Quantization is a technique used in deep neural networks (DNNs) to increase execution performance and hardware efficiency. We propose a sparsity-aware quantization (SPARQ) method, in which the unstructured and dynamic activation sparsity is leveraged in different representation granularities. SPARQ achieves minor accuracy degradation, 2x speedup over widely used hardware architectures, and a practical hardware implementation.
arXiv Detail & Related papers (2021-05-23T20:12:35Z)
ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training [68.63354877166756]
ActNN is a memory-efficient training framework that stores randomly quantized activations for back propagation. ActNN reduces the memory footprint of the activation by 12x, and it enables training with a 6.6x to 14x larger batch size.
arXiv Detail & Related papers (2021-04-29T05:50:54Z)
Exploring the Potential of Low-bit Training of Convolutional Neural Networks [16.72709290595995]
We propose a low-bit training framework for convolutional neural networks. Our framework is built around a novel multi-level scaling (MLS) tensor format. Experiments show that our framework achieves a superior trade-off between the accuracy and the bit-width.
arXiv Detail & Related papers (2020-06-04T12:09:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.