BitNet a4.8: 4-bit Activations for 1-bit LLMs
- URL: http://arxiv.org/abs/2411.04965v1
- Date: Thu, 07 Nov 2024 18:41:50 GMT
- Title: BitNet a4.8: 4-bit Activations for 1-bit LLMs
- Authors: Hongyu Wang, Shuming Ma, Furu Wei,
- Abstract summary: We introduce BitNet a4.8, enabling 4-bit activations for 1-bit Large Language Models.
We demonstrate that BitNet a4.8 achieves performance comparable to BitNet b1.58 with equivalent training costs.
- Score: 95.73339037243105
- License:
- Abstract: Recent research on the 1-bit Large Language Models (LLMs), such as BitNet b1.58, presents a promising direction for reducing the inference cost of LLMs while maintaining their performance. In this work, we introduce BitNet a4.8, enabling 4-bit activations for 1-bit LLMs. BitNet a4.8 employs a hybrid quantization and sparsification strategy to mitigate the quantization errors introduced by the outlier channels. Specifically, we utilize 4-bit activations for inputs to the attention and feed-forward network layers, while sparsifying intermediate states followed with 8-bit quantization. Extensive experiments demonstrate that BitNet a4.8 achieves performance comparable to BitNet b1.58 with equivalent training costs, while being faster in inference with enabling 4-bit (INT4/FP4) kernels. Additionally, BitNet a4.8 activates only 55% of parameters and supports 3-bit KV cache, further enhancing the efficiency of large-scale LLM deployment and inference.
Related papers
- 1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs [81.7388752468953]
We introduce bitnet, a tailored software stack designed to unlock the full potential of 1-bit Large Language Models.
In experiments, bitnet achieves significant speedups ranging from 2.37x to 6.17x on x CPUs and from 1.37x to 5.07x on ARM.
arXiv Detail & Related papers (2024-10-21T16:14:57Z) - OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs.
Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z) - BitNet: Scaling 1-bit Transformers for Large Language Models [119.18692348616845]
We introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models.
Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption.
arXiv Detail & Related papers (2023-10-17T17:59:15Z) - Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded
Chipsets [7.5195830365852085]
We propose a novel sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model.
We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data.
arXiv Detail & Related papers (2022-07-13T17:46:08Z) - Post-Training Sparsity-Aware Quantization [2.2530496464901106]
Quantization is a technique used in deep neural networks (DNNs) to increase execution performance and hardware efficiency.
We propose a sparsity-aware quantization (SPARQ) method, in which the unstructured and dynamic activation sparsity is leveraged in different representation granularities.
SPARQ achieves minor accuracy degradation, 2x speedup over widely used hardware architectures, and a practical hardware implementation.
arXiv Detail & Related papers (2021-05-23T20:12:35Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z) - Fast Implementation of 4-bit Convolutional Neural Networks for Mobile
Devices [0.8362190332905524]
We show an efficient implementation of 4-bit matrix multiplication for quantized neural networks.
We also demonstrate a 4-bit quantized neural network for OCR recognition on the MIDV-500 dataset.
The results show that 4-bit quantization perfectly suits mobile devices, yielding good enough accuracy and low inference time.
arXiv Detail & Related papers (2020-09-14T14:48:40Z) - BitPruning: Learning Bitlengths for Aggressive and Accurate Quantization [57.14179747713731]
We introduce a training method for minimizing inference bitlength at any granularity while maintaining accuracy.
With ImageNet, the method produces an average per layer bitlength of 4.13, 3.76 and 4.36 bits.
arXiv Detail & Related papers (2020-02-08T04:58:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.