When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization
- URL: http://arxiv.org/abs/2411.05882v1
- Date: Fri, 08 Nov 2024 07:24:49 GMT
- Title: When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization
- Authors: Jacob Nielsen, Lukas Galke, Peter Schneider-Kamp,
- Abstract summary: We show that decoder-only language models can be trained to a competitive state with ternary weights (1.58 bits per weight)
Our results show that 1.58-bit training is on par with or sometimes even better than the standard 32/16-bit models.
- Score: 5.67099529296254
- License:
- Abstract: Contemporary machine learning models, such as language models, are powerful, but come with immense resource requirements both at training and inference time. It has been shown that decoder-only language models can be trained to a competitive state with ternary weights (1.58 bits per weight), facilitating efficient inference. Here, we start our exploration with non-transformer model architectures, investigating 1.58-bit training for multi-layer perceptrons and graph neural networks. Then, we explore 1.58-bit training in other transformer-based language models, namely encoder-only and encoder-decoder models. Our results show that in all of these settings, 1.58-bit training is on par with or sometimes even better than the standard 32/16-bit models.
Related papers
- Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task [1.9107347888374506]
We study the scaling laws of decoder-only models on the multilingual and multidomain translation task.
We show that the loss of decoder-only models can be estimated using a scaling law similar to the one discovered for large language models.
We also show that scaling the depth and the width of a model lead to similar test loss improvements, but with different impact on the model's efficiency.
arXiv Detail & Related papers (2024-09-23T14:26:01Z) - BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks [2.2300270962881075]
In this work, we investigate 1.58-bit quantization for small language and vision models ranging from 100K to 48M parameters.
We introduce a variant of BitNet b1.58, which allows to rely on the median rather than the mean in the quantization process.
arXiv Detail & Related papers (2024-06-24T20:55:36Z) - What Language Model to Train if You Have One Million GPU Hours? [54.32062236748831]
We study the impact of different modeling practices and their impact on zero-shot generalization.
We also study the performance of a multilingual model and how it compares to the English-only one.
All our models and code are open-sourced at https://huggingface.co/bigscience.
arXiv Detail & Related papers (2022-10-27T13:43:27Z) - Language Models are General-Purpose Interfaces [109.45478241369655]
We propose to use language models as a general-purpose interface to various foundation models.
A collection of pretrained encoders perceive diverse modalities (such as vision, and language)
We propose a semi-causal language modeling objective to jointly pretrain the interface and the modular encoders.
arXiv Detail & Related papers (2022-06-13T17:34:22Z) - A Study on Transformer Configuration and Training Objective [33.7272660870026]
We propose Bamboo, an idea of using deeper and narrower transformer configurations for masked autoencoder training.
On ImageNet, with such a simple change in configuration, re-designed model achieves 87.1% top-1 accuracy.
On language tasks, re-designed model outperforms BERT with default setting by 1.1 points on average.
arXiv Detail & Related papers (2022-05-21T05:17:11Z) - OPT: Open Pre-trained Transformer Language Models [99.60254017109551]
We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters.
We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop.
arXiv Detail & Related papers (2022-05-02T17:49:50Z) - ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking
Inference [70.36083572306839]
This paper proposes a new training and inference paradigm for re-ranking.
We finetune a pretrained encoder-decoder model using in the form of document to query generation.
We show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference.
arXiv Detail & Related papers (2022-04-25T06:26:29Z) - What Language Model Architecture and Pretraining Objective Work Best for
Zero-Shot Generalization? [50.84738303888189]
We present a large-scale evaluation of modeling choices and their impact on zero-shot generalization.
We train models with over 5 billion parameters for more than 170 billion tokens.
We find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models.
arXiv Detail & Related papers (2022-04-12T14:19:49Z) - Very Deep Transformers for Neural Machine Translation [100.51465892354234]
We show that it is feasible to build standard Transformer-based models with up to 60 encoder layers and 12 decoder layers.
These deep models outperform their baseline 6-layer counterparts by as much as 2.5 BLEU.
arXiv Detail & Related papers (2020-08-18T07:14:54Z) - Attention Is All You Need [36.87735219227719]
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms.
Experiments on two machine translation tasks show these models to be superior in quality.
arXiv Detail & Related papers (2017-06-12T17:57:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.