1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs
- URL: http://arxiv.org/abs/2410.16144v2
- Date: Wed, 23 Oct 2024 11:17:42 GMT
- Title: 1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs
- Authors: Jinheng Wang, Hansong Zhou, Ting Song, Shaoguang Mao, Shuming Ma, Hongyu Wang, Yan Xia, Furu Wei,
- Abstract summary: We introduce bitnet, a tailored software stack designed to unlock the full potential of 1-bit Large Language Models.
In experiments, bitnet achieves significant speedups ranging from 2.37x to 6.17x on x CPUs and from 1.37x to 5.07x on ARM.
- Score: 81.7388752468953
- License:
- Abstract: Recent advances in 1-bit Large Language Models (LLMs), such as BitNet and BitNet b1.58, present a promising approach to enhancing the efficiency of LLMs in terms of speed and energy consumption. These developments also enable local LLM deployment across a broad range of devices. In this work, we introduce bitnet.cpp, a tailored software stack designed to unlock the full potential of 1-bit LLMs. Specifically, we develop a set of kernels to support fast and lossless inference of ternary BitNet b1.58 LLMs on CPUs. Extensive experiments demonstrate that bitnet.cpp achieves significant speedups, ranging from 2.37x to 6.17x on x86 CPUs and from 1.37x to 5.07x on ARM CPUs, across various model sizes. The code is available at https://github.com/microsoft/BitNet.
Related papers
- BitNet a4.8: 4-bit Activations for 1-bit LLMs [95.73339037243105]
We introduce BitNet a4.8, enabling 4-bit activations for 1-bit Large Language Models.
We demonstrate that BitNet a4.8 achieves performance comparable to BitNet b1.58 with equivalent training costs.
arXiv Detail & Related papers (2024-11-07T18:41:50Z) - The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits [129.6765656933016]
We introduce a 1-bit Large Language Models (LLMs) variant, namely BitNet b1.58.
The 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs.
It enables a new paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
arXiv Detail & Related papers (2024-02-27T18:56:19Z) - BitNet: Scaling 1-bit Transformers for Large Language Models [119.18692348616845]
We introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models.
Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption.
arXiv Detail & Related papers (2023-10-17T17:59:15Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks [15.519170283930276]
We propose a novel partial convolution (PConv) that extracts spatial features more efficiently, by cutting down redundant computation and memory access simultaneously.
Building upon our PConv, we further propose FasterNet, a new family of neural networks, which attains substantially higher running speed than others on a wide range of devices.
Our large FasterNet-L achieves impressive $83.5%$ top-1 accuracy, on par with the emerging Swin-B, while having $36%$ higher inference throughput on GPU.
arXiv Detail & Related papers (2023-03-07T06:05:30Z) - BEANNA: A Binary-Enabled Architecture for Neural Network Acceleration [0.0]
This paper proposes and evaluates a neural network hardware accelerator capable of processing both floating point and binary network layers.
Running at a clock speed of 100MHz, BEANNA achieves a peak throughput of 52.8 GigaOps/second.
arXiv Detail & Related papers (2021-08-04T23:17:34Z) - MCUNet: Tiny Deep Learning on IoT Devices [62.752899523628066]
We propose a framework that jointly designs the efficient neural architecture (TinyNAS) and the lightweight inference engine (TinyEngine)
TinyNAS adopts a two-stage neural architecture search approach that first optimize the search space to fit the resource constraints, then specializes the network architecture in the optimized search space.
TinyEngine adapts the memory scheduling according to the overall network topology rather than layer-wise optimization, reducing the memory usage by 4.8x.
arXiv Detail & Related papers (2020-07-20T17:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.