Related papers: 1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs

1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs

URL: http://arxiv.org/abs/2410.16144v2
Date: Wed, 23 Oct 2024 11:17:42 GMT
Title: 1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs
Authors: Jinheng Wang, Hansong Zhou, Ting Song, Shaoguang Mao, Shuming Ma, Hongyu Wang, Yan Xia, Furu Wei,
Abstract summary: We introduce bitnet, a tailored software stack designed to unlock the full potential of 1-bit Large Language Models. In experiments, bitnet achieves significant speedups ranging from 2.37x to 6.17x on x CPUs and from 1.37x to 5.07x on ARM.
Score: 81.7388752468953
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in 1-bit Large Language Models (LLMs), such as BitNet and BitNet b1.58, present a promising approach to enhancing the efficiency of LLMs in terms of speed and energy consumption. These developments also enable local LLM deployment across a broad range of devices. In this work, we introduce bitnet.cpp, a tailored software stack designed to unlock the full potential of 1-bit LLMs. Specifically, we develop a set of kernels to support fast and lossless inference of ternary BitNet b1.58 LLMs on CPUs. Extensive experiments demonstrate that bitnet.cpp achieves significant speedups, ranging from 2.37x to 6.17x on x86 CPUs and from 1.37x to 5.07x on ARM CPUs, across various model sizes. The code is available at https://github.com/microsoft/BitNet.

Related papers

BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs [95.73339037243105]
BitNet v2 is a framework enabling native 4-bit activation quantization for 1-bit Large Language Models. H-BitLinear is a module applying an online Hadamard transformation prior to activation quantization. Experiments show BitNet v2 trained from scratch with 8-bit activations matches BitNet b1.58 performance.
arXiv Detail & Related papers (2025-04-25T15:17:52Z)
BitNet b1.58 2B4T Technical Report [118.78752947128682]
We introduce BitNet b1.58 2B4T, the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale. Trained on a corpus of 4 trillion tokens, the model has been rigorously evaluated across benchmarks covering language understanding, mathematical reasoning, coding proficiency, and conversational ability.
arXiv Detail & Related papers (2025-04-16T17:51:43Z)
Bitnet.cpp: Efficient Edge Inference for Ternary LLMs [71.5759603658299]
We introduce Bitnet, an inference system optimized for BitNet b1.58 and ternary LLMs. Bitnet incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference. Our experiments show that Bitnet achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines.
arXiv Detail & Related papers (2025-02-17T15:06:28Z)
BitNet a4.8: 4-bit Activations for 1-bit LLMs [95.73339037243105]
We introduce BitNet a4.8, enabling 4-bit activations for 1-bit Large Language Models. We demonstrate that BitNet a4.8 achieves performance comparable to BitNet b1.58 with equivalent training costs.
arXiv Detail & Related papers (2024-11-07T18:41:50Z)
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits [129.6765656933016]
We introduce a 1-bit Large Language Models (LLMs) variant, namely BitNet b1.58. The 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs. It enables a new paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
arXiv Detail & Related papers (2024-02-27T18:56:19Z)
BitNet: Scaling 1-bit Transformers for Large Language Models [119.18692348616845]
We introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption.
arXiv Detail & Related papers (2023-10-17T17:59:15Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks [15.519170283930276]
We propose a novel partial convolution (PConv) that extracts spatial features more efficiently, by cutting down redundant computation and memory access simultaneously. Building upon our PConv, we further propose FasterNet, a new family of neural networks, which attains substantially higher running speed than others on a wide range of devices. Our large FasterNet-L achieves impressive $83.5%$ top-1 accuracy, on par with the emerging Swin-B, while having $36%$ higher inference throughput on GPU.
arXiv Detail & Related papers (2023-03-07T06:05:30Z)
BEANNA: A Binary-Enabled Architecture for Neural Network Acceleration [0.0]
This paper proposes and evaluates a neural network hardware accelerator capable of processing both floating point and binary network layers. Running at a clock speed of 100MHz, BEANNA achieves a peak throughput of 52.8 GigaOps/second.
arXiv Detail & Related papers (2021-08-04T23:17:34Z)
MCUNet: Tiny Deep Learning on IoT Devices [62.752899523628066]
We propose a framework that jointly designs the efficient neural architecture (TinyNAS) and the lightweight inference engine (TinyEngine) TinyNAS adopts a two-stage neural architecture search approach that first optimize the search space to fit the resource constraints, then specializes the network architecture in the optimized search space. TinyEngine adapts the memory scheduling according to the overall network topology rather than layer-wise optimization, reducing the memory usage by 4.8x.
arXiv Detail & Related papers (2020-07-20T17:59:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.