BitNet: Scaling 1-bit Transformers for Large Language Models
- URL: http://arxiv.org/abs/2310.11453v1
- Date: Tue, 17 Oct 2023 17:59:15 GMT
- Title: BitNet: Scaling 1-bit Transformers for Large Language Models
- Authors: Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang,
Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei
- Abstract summary: We introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models.
Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption.
- Score: 119.18692348616845
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The increasing size of large language models has posed challenges for
deployment and raised concerns about environmental impact due to high energy
consumption. In this work, we introduce BitNet, a scalable and stable 1-bit
Transformer architecture designed for large language models. Specifically, we
introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to
train 1-bit weights from scratch. Experimental results on language modeling
show that BitNet achieves competitive performance while substantially reducing
memory footprint and energy consumption, compared to state-of-the-art 8-bit
quantization methods and FP16 Transformer baselines. Furthermore, BitNet
exhibits a scaling law akin to full-precision Transformers, suggesting its
potential for effective scaling to even larger language models while
maintaining efficiency and performance benefits.
Related papers
- Layer-Condensed KV Cache for Efficient Inference of Large Language Models [44.24593677113768]
We propose a novel method that only computes and caches the KVs of a small number of layers.
Our method achieves up to 26$times$ higher throughput than standard transformers.
arXiv Detail & Related papers (2024-05-17T08:59:46Z) - Binary and Ternary Natural Language Generation [24.295815261826153]
Ternary and binary neural networks enable multiplication-free computation.
They promise multiple orders of magnitude efficiency gains over full-precision networks.
However, such networks have proven very difficult to optimize.
We show first ternary and binary transformer models on the downstream tasks of summarization and machine translation.
arXiv Detail & Related papers (2023-06-02T18:01:02Z) - MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers [78.85346970193518]
Megabyte is a multi-scale decoder architecture that enables end-to-end differentiable modeling of sequences of over one million bytes.
Experiments show that Megabyte allows byte-level models to perform competitively with subword models on long context language modeling.
Results establish the viability of tokenization-free autoregressive sequence modeling at scale.
arXiv Detail & Related papers (2023-05-12T00:55:41Z) - Binarized Neural Machine Translation [43.488431560851204]
We propose a novel binarization technique for Transformers applied to machine translation (BMT)
We identify and address the problem of inflated dot-product variance when using one-bit weights and activations.
Experiments on the WMT dataset show that a one-bit weight-only Transformer can achieve the same quality as a float one, while being 16x smaller in size.
arXiv Detail & Related papers (2023-02-09T19:27:34Z) - The case for 4-bit precision: k-bit Inference Scaling Laws [75.4335600212427]
Quantization methods reduce the number of bits required to represent each parameter in a model.
The final model size depends on both the number of parameters of the original model and the rate of compression.
We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision.
arXiv Detail & Related papers (2022-12-19T18:48:33Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z) - Bottleneck Transformers for Visual Recognition [97.16013761605254]
We present BoTNet, a powerful backbone architecture that incorporates self-attention for vision tasks.
We present models that achieve a strong performance of 84.7% top-1 accuracy on the ImageNet benchmark.
We hope our simple and effective approach will serve as a strong baseline for future research in self-attention models for vision.
arXiv Detail & Related papers (2021-01-27T18:55:27Z) - Towards Fully 8-bit Integer Inference for the Transformer Model [39.22272841663168]
We show that after a principled modification on the Transformer architecture, dubbed Transformer, an (almost) fully 8-bit integer inference algorithm could be derived.
Our experiments on WMT16 En->Ro, WMT14 En->De and En->Fr translation tasks as well as the WikiText-103 language modelling task show that the fully 8-bit Transformer system achieves comparable performance with the floating point baseline but requires nearly 4x less memory footprint.
arXiv Detail & Related papers (2020-09-17T03:09:10Z) - Extremely Low Bit Transformer Quantization for On-Device Neural Machine
Translation [9.770173256808844]
We propose a mixed precision quantization strategy to represent Transformer weights by an extremely low number of bits.
Our model achieves 11.8$times$ smaller model size than the baseline model, with less than -0.5 BLEU.
We achieve 8.3$times$ reduction in run-time memory footprints and 3.5$times$ speed up.
arXiv Detail & Related papers (2020-09-16T03:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.