Related papers: BitNet a4.8: 4-bit Activations for 1-bit LLMs

BitNet a4.8: 4-bit Activations for 1-bit LLMs

URL: http://arxiv.org/abs/2411.04965v1
Date: Thu, 07 Nov 2024 18:41:50 GMT
Title: BitNet a4.8: 4-bit Activations for 1-bit LLMs
Authors: Hongyu Wang, Shuming Ma, Furu Wei,
Abstract summary: We introduce BitNet a4.8, enabling 4-bit activations for 1-bit Large Language Models. We demonstrate that BitNet a4.8 achieves performance comparable to BitNet b1.58 with equivalent training costs.
Score: 95.73339037243105
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent research on the 1-bit Large Language Models (LLMs), such as BitNet b1.58, presents a promising direction for reducing the inference cost of LLMs while maintaining their performance. In this work, we introduce BitNet a4.8, enabling 4-bit activations for 1-bit LLMs. BitNet a4.8 employs a hybrid quantization and sparsification strategy to mitigate the quantization errors introduced by the outlier channels. Specifically, we utilize 4-bit activations for inputs to the attention and feed-forward network layers, while sparsifying intermediate states followed with 8-bit quantization. Extensive experiments demonstrate that BitNet a4.8 achieves performance comparable to BitNet b1.58 with equivalent training costs, while being faster in inference with enabling 4-bit (INT4/FP4) kernels. Additionally, BitNet a4.8 activates only 55% of parameters and supports 3-bit KV cache, further enhancing the efficiency of large-scale LLM deployment and inference.

Related papers

BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs [95.73339037243105]
BitNet v2 is a framework enabling native 4-bit activation quantization for 1-bit Large Language Models. H-BitLinear is a module applying an online Hadamard transformation prior to activation quantization. Experiments show BitNet v2 trained from scratch with 8-bit activations matches BitNet b1.58 performance.
arXiv Detail & Related papers (2025-04-25T15:17:52Z)
BitNet b1.58 2B4T Technical Report [118.78752947128682]
We introduce BitNet b1.58 2B4T, the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale. Trained on a corpus of 4 trillion tokens, the model has been rigorously evaluated across benchmarks covering language understanding, mathematical reasoning, coding proficiency, and conversational ability.
arXiv Detail & Related papers (2025-04-16T17:51:43Z)
Bitnet.cpp: Efficient Edge Inference for Ternary LLMs [71.5759603658299]
We introduce Bitnet, an inference system optimized for BitNet b1.58 and ternary LLMs. Bitnet incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference. Our experiments show that Bitnet achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines.
arXiv Detail & Related papers (2025-02-17T15:06:28Z)
1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs [81.7388752468953]
We introduce bitnet, a tailored software stack designed to unlock the full potential of 1-bit Large Language Models. In experiments, bitnet achieves significant speedups ranging from 2.37x to 6.17x on x CPUs and from 1.37x to 5.07x on ARM.
arXiv Detail & Related papers (2024-10-21T16:14:57Z)
OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs. Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z)
BitNet: Scaling 1-bit Transformers for Large Language Models [119.18692348616845]
We introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption.
arXiv Detail & Related papers (2023-10-17T17:59:15Z)
Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded Chipsets [7.5195830365852085]
We propose a novel sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model. We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data.
arXiv Detail & Related papers (2022-07-13T17:46:08Z)
Post-Training Sparsity-Aware Quantization [2.2530496464901106]
Quantization is a technique used in deep neural networks (DNNs) to increase execution performance and hardware efficiency. We propose a sparsity-aware quantization (SPARQ) method, in which the unstructured and dynamic activation sparsity is leveraged in different representation granularities. SPARQ achieves minor accuracy degradation, 2x speedup over widely used hardware architectures, and a practical hardware implementation.
arXiv Detail & Related papers (2021-05-23T20:12:35Z)
HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values. We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z)
Fast Implementation of 4-bit Convolutional Neural Networks for Mobile Devices [0.8362190332905524]
We show an efficient implementation of 4-bit matrix multiplication for quantized neural networks. We also demonstrate a 4-bit quantized neural network for OCR recognition on the MIDV-500 dataset. The results show that 4-bit quantization perfectly suits mobile devices, yielding good enough accuracy and low inference time.
arXiv Detail & Related papers (2020-09-14T14:48:40Z)
BitPruning: Learning Bitlengths for Aggressive and Accurate Quantization [57.14179747713731]
We introduce a training method for minimizing inference bitlength at any granularity while maintaining accuracy. With ImageNet, the method produces an average per layer bitlength of 4.13, 3.76 and 4.36 bits.
arXiv Detail & Related papers (2020-02-08T04:58:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.