The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
- URL: http://arxiv.org/abs/2402.17764v1
- Date: Tue, 27 Feb 2024 18:56:19 GMT
- Title: The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
- Authors: Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan
Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei
- Abstract summary: We introduce a 1-bit Large Language Models (LLMs) variant, namely BitNet b1.58.
The 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs.
It enables a new paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
- Score: 129.6765656933016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research, such as BitNet, is paving the way for a new era of 1-bit
Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant,
namely BitNet b1.58, in which every single parameter (or weight) of the LLM is
ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16)
Transformer LLM with the same model size and training tokens in terms of both
perplexity and end-task performance, while being significantly more
cost-effective in terms of latency, memory, throughput, and energy consumption.
More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for
training new generations of LLMs that are both high-performance and
cost-effective. Furthermore, it enables a new computation paradigm and opens
the door for designing specific hardware optimized for 1-bit LLMs.
Related papers
- Matmul or No Matmal in the Era of 1-bit LLMs [0.48212500317840945]
1-bit large language models (LLMs) have attracted considerable attention and opened up new research opportunities.
However, 1-bit LLMs only improve a fraction of models by applying extreme quantization to the projection layers.
In this work, we present an adaptation of Amdahl's Law tailored for the 1-bit LLM context.
arXiv Detail & Related papers (2024-08-21T18:44:21Z) - Q-Sparse: All Large Language Models can be Fully Sparsely-Activated [93.45300714803429]
We introduce Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs)
Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference.
We also introduce Block Q-Sparse for batch training and inference.
arXiv Detail & Related papers (2024-07-15T17:59:29Z) - FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation [32.01836613286288]
This work presents a Fully BInarized Large Language Model (FBI-LLM)
It demonstrates for the first time how to train a large-scale binary language model from scratch.
arXiv Detail & Related papers (2024-07-09T17:59:48Z) - Scalable MatMul-free Language Modeling [8.672867887354977]
We show that MatMul operations can be completely eliminated from large language models.
Our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers.
arXiv Detail & Related papers (2024-06-04T17:50:34Z) - OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs.
Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [67.38165028487242]
We introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach to fine-tune large language models (LLMs)
Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs.
Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs.
arXiv Detail & Related papers (2023-10-13T07:38:52Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.