BitNet b1.58 2B4T Technical Report
- URL: http://arxiv.org/abs/2504.12285v2
- Date: Fri, 25 Apr 2025 03:07:55 GMT
- Title: BitNet b1.58 2B4T Technical Report
- Authors: Shuming Ma, Hongyu Wang, Shaohan Huang, Xingxing Zhang, Ying Hu, Ting Song, Yan Xia, Furu Wei,
- Abstract summary: We introduce BitNet b1.58 2B4T, the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale.<n>Trained on a corpus of 4 trillion tokens, the model has been rigorously evaluated across benchmarks covering language understanding, mathematical reasoning, coding proficiency, and conversational ability.
- Score: 118.78752947128682
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce BitNet b1.58 2B4T, the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale. Trained on a corpus of 4 trillion tokens, the model has been rigorously evaluated across benchmarks covering language understanding, mathematical reasoning, coding proficiency, and conversational ability. Our results demonstrate that BitNet b1.58 2B4T achieves performance on par with leading open-weight, full-precision LLMs of similar size, while offering significant advantages in computational efficiency, including substantially reduced memory footprint, energy consumption, and decoding latency. To facilitate further research and adoption, the model weights are released via Hugging Face along with open-source inference implementations for both GPU and CPU architectures.
Related papers
- Bitnet.cpp: Efficient Edge Inference for Ternary LLMs [71.5759603658299]
We introduce Bitnet, an inference system optimized for BitNet b1.58 and ternary LLMs.<n>Bitnet incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference.<n>Our experiments show that Bitnet achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines.
arXiv Detail & Related papers (2025-02-17T15:06:28Z) - Unlocking Efficient Large Inference Models: One-Bit Unrolling Tips the Scales [13.846014191157405]
We introduce a novel approach that leverages one-bit algorithm unrolling, effectively integrating information from the physical world in the model architecture.<n>Our method achieves a bit-per-link rate significantly lower than the 1.58 bits reported in prior work.<n>We demonstrate that the proposed one-bit algorithm unrolling scheme can improve both training and test outcomes.
arXiv Detail & Related papers (2025-02-04T00:53:10Z) - 1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs [81.7388752468953]
We introduce bitnet, a tailored software stack designed to unlock the full potential of 1-bit Large Language Models.
In experiments, bitnet achieves significant speedups ranging from 2.37x to 6.17x on x CPUs and from 1.37x to 5.07x on ARM.
arXiv Detail & Related papers (2024-10-21T16:14:57Z) - BitNet: Scaling 1-bit Transformers for Large Language Models [119.18692348616845]
We introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models.
Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption.
arXiv Detail & Related papers (2023-10-17T17:59:15Z) - QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits.
We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit.
We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z) - Compacting Binary Neural Networks by Sparse Kernel Selection [58.84313343190488]
This paper is motivated by a previously revealed phenomenon that the binary kernels in successful BNNs are nearly power-law distributed.
We develop the Permutation Straight-Through Estimator (PSTE) that is able to not only optimize the selection process end-to-end but also maintain the non-repetitive occupancy of selected codewords.
Experiments verify that our method reduces both the model size and bit-wise computational costs, and achieves accuracy improvements compared with state-of-the-art BNNs under comparable budgets.
arXiv Detail & Related papers (2023-03-25T13:53:02Z) - MeliusNet: Can Binary Neural Networks Achieve MobileNet-level Accuracy? [12.050205584630922]
Binary Neural Networks (BNNs) are neural networks which use binary weights and activations instead of the typical 32-bit floating point values.
In this paper, we present an architectural approach: MeliusNet. It consists of alternating a DenseBlock, which increases the feature capacity, and our proposed ImprovementBlock, which increases the feature quality.
arXiv Detail & Related papers (2020-01-16T16:56:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.