The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
- URL: http://arxiv.org/abs/2402.17764v1
- Date: Tue, 27 Feb 2024 18:56:19 GMT
- Title: The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
- Authors: Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan
Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei
- Abstract summary: We introduce a 1-bit Large Language Models (LLMs) variant, namely BitNet b1.58.
The 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs.
It enables a new paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
- Score: 129.6765656933016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research, such as BitNet, is paving the way for a new era of 1-bit
Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant,
namely BitNet b1.58, in which every single parameter (or weight) of the LLM is
ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16)
Transformer LLM with the same model size and training tokens in terms of both
perplexity and end-task performance, while being significantly more
cost-effective in terms of latency, memory, throughput, and energy consumption.
More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for
training new generations of LLMs that are both high-performance and
cost-effective. Furthermore, it enables a new computation paradigm and opens
the door for designing specific hardware optimized for 1-bit LLMs.
Related papers
- Bitnet.cpp: Efficient Edge Inference for Ternary LLMs [71.5759603658299]
We introduce Bitnet, an inference system optimized for BitNet b1.58 and ternary LLMs.
Bitnet incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference.
Our experiments show that Bitnet achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines.
arXiv Detail & Related papers (2025-02-17T15:06:28Z) - Matmul or No Matmal in the Era of 1-bit LLMs [0.48212500317840945]
1-bit large language models (LLMs) have attracted considerable attention and opened up new research opportunities.
However, 1-bit LLMs only improve a fraction of models by applying extreme quantization to the projection layers.
In this work, we present an adaptation of Amdahl's Law tailored for the 1-bit LLM context.
arXiv Detail & Related papers (2024-08-21T18:44:21Z) - Q-Sparse: All Large Language Models can be Fully Sparsely-Activated [93.45300714803429]
We introduce Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs)
Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference.
We also introduce Block Q-Sparse for batch training and inference.
arXiv Detail & Related papers (2024-07-15T17:59:29Z) - FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation [32.01836613286288]
This work presents a Fully BInarized Large Language Model (FBI-LLM)
It demonstrates for the first time how to train a large-scale binary language model from scratch.
arXiv Detail & Related papers (2024-07-09T17:59:48Z) - Scalable MatMul-free Language Modeling [8.672867887354977]
We show that MatMul operations can be completely eliminated from large language models.
Our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers.
arXiv Detail & Related papers (2024-06-04T17:50:34Z) - OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs.
Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.