Related papers: Bitnet.cpp: Efficient Edge Inference for Ternary LLMs

Bitnet.cpp: Efficient Edge Inference for Ternary LLMs

URL: http://arxiv.org/abs/2502.11880v1
Date: Mon, 17 Feb 2025 15:06:28 GMT
Title: Bitnet.cpp: Efficient Edge Inference for Ternary LLMs
Authors: Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, Furu Wei,
Abstract summary: We introduce Bitnet, an inference system optimized for BitNet b1.58 and ternary LLMs.<n>Bitnet incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference.<n>Our experiments show that Bitnet achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines.
Score: 71.5759603658299
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The advent of 1-bit large language models (LLMs), led by BitNet b1.58, has spurred interest in ternary LLMs. Despite this, research and practical applications focusing on efficient edge inference for ternary LLMs remain scarce. To bridge this gap, we introduce Bitnet.cpp, an inference system optimized for BitNet b1.58 and ternary LLMs. Given that mixed-precision matrix multiplication (mpGEMM) constitutes the bulk of inference time in ternary LLMs, Bitnet.cpp incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference. The library features two core solutions: Ternary Lookup Table (TL), which addresses spatial inefficiencies of previous bit-wise methods, and Int2 with a Scale (I2_S), which ensures lossless edge inference, both enabling high-speed inference. Our experiments show that Bitnet.cpp achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines, setting new benchmarks in the field. Additionally, we expand TL to element-wise lookup table (ELUT) for low-bit LLMs in the appendix, presenting both theoretical and empirical evidence of its considerable potential. Bitnet.cpp is publicly available at https://github.com/microsoft/BitNet/tree/paper , offering a sophisticated solution for the efficient and practical deployment of edge LLMs.

Related papers

BTC-LLM: Efficient Sub-1-Bit LLM Quantization via Learnable Transformation and Binary Codebook [20.89001326838199]
We present BTC-LLM, a novel sub-1-bit large language model (LLM) quantization framework.<n>Our approach incorporates two key innovations: (1) a Learnable Transformation that optimize invertible scaling and rotation to align binarized weights with full-precision distributions, and (2) a Flash and Accurate Binary Codebook that identifies recurring binary vector clusters.
arXiv Detail & Related papers (2025-05-24T03:57:19Z)
BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs [95.73339037243105]
BitNet v2 is a framework enabling native 4-bit activation quantization for 1-bit Large Language Models. H-BitLinear is a module applying an online Hadamard transformation prior to activation quantization. Experiments show BitNet v2 trained from scratch with 8-bit activations matches BitNet b1.58 performance.
arXiv Detail & Related papers (2025-04-25T15:17:52Z)
BitNet b1.58 2B4T Technical Report [118.78752947128682]
We introduce BitNet b1.58 2B4T, the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale. Trained on a corpus of 4 trillion tokens, the model has been rigorously evaluated across benchmarks covering language understanding, mathematical reasoning, coding proficiency, and conversational ability.
arXiv Detail & Related papers (2025-04-16T17:51:43Z)
1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs [81.7388752468953]
We introduce bitnet, a tailored software stack designed to unlock the full potential of 1-bit Large Language Models. In experiments, bitnet achieves significant speedups ranging from 2.37x to 6.17x on x CPUs and from 1.37x to 5.07x on ARM.
arXiv Detail & Related papers (2024-10-21T16:14:57Z)
CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs [44.03692512352445]
Column-Level Adaptive weight Quantization (CLAQ) is a novel and effective framework for Large Language Models (LLMs) quantization. In this paper, we present a novel and effective CLAQ framework by introducing three different types of adaptive strategies for LLM quantization. Experiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2 and Yi demonstrate that our methods achieve the state-of-the-art results across different bit settings.
arXiv Detail & Related papers (2024-05-27T14:49:39Z)
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits [129.6765656933016]
We introduce a 1-bit Large Language Models (LLMs) variant, namely BitNet b1.58. The 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs. It enables a new paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
arXiv Detail & Related papers (2024-02-27T18:56:19Z)
OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs.<n>Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.