BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks
- URL: http://arxiv.org/abs/2407.09527v1
- Date: Mon, 24 Jun 2024 20:55:36 GMT
- Title: BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks
- Authors: Jacob Nielsen, Peter Schneider-Kamp,
- Abstract summary: In this work, we investigate 1.58-bit quantization for small language and vision models ranging from 100K to 48M parameters.
We introduce a variant of BitNet b1.58, which allows to rely on the median rather than the mean in the quantization process.
- Score: 2.2300270962881075
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently proposed methods for 1-bit and 1.58-bit quantization aware training investigate the performance and behavior of these methods in the context of large language models, finding state-of-the-art performance for models with more than 3B parameters. In this work, we investigate 1.58-bit quantization for small language and vision models ranging from 100K to 48M parameters. We introduce a variant of BitNet b1.58, which allows to rely on the median rather than the mean in the quantization process. Through extensive experiments we investigate the performance of 1.58-bit models obtained through quantization aware training. We further investigate the robustness of 1.58-bit quantization-aware training to changes in the learning rate and regularization through weight decay, finding different patterns for small language and vision models than previously reported for large language models. Our results showcase that 1.58-bit quantization-aware training provides state-of-the-art performance for small language models when doubling hidden layer sizes and reaches or even surpasses state-of-the-art performance for small vision models of identical size. Ultimately, we demonstrate that 1.58-bit quantization-aware training is a viable and promising approach also for training smaller deep learning networks, facilitating deployment of such models in low-resource use-cases and encouraging future research.
Related papers
- When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization [5.67099529296254]
We show that decoder-only language models can be trained to a competitive state with ternary weights (1.58 bits per weight)
Our results show that 1.58-bit training is on par with or sometimes even better than the standard 32/16-bit models.
arXiv Detail & Related papers (2024-11-08T07:24:49Z) - A Hitchhiker's Guide to Scaling Law Estimation [56.06982415792523]
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets.
We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
arXiv Detail & Related papers (2024-10-15T17:59:10Z) - BitsFusion: 1.99 bits Weight Quantization of Diffusion Model [43.11229823281721]
We develop a novel weight quantization method that quantizes the UNet from Stable Diffusion v1.5 to 1.99 bits, achieving a model with 7.9X smaller size.
We extensively evaluate our quantized model across various benchmark datasets and through human evaluation to demonstrate its superior generation quality.
arXiv Detail & Related papers (2024-06-06T17:59:23Z) - Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models.
We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models.
We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z) - The effectiveness of MAE pre-pretraining for billion-scale pretraining [65.98338857597935]
We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model.
We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition.
arXiv Detail & Related papers (2023-03-23T17:56:12Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Scaling Instruction-Finetuned Language Models [126.4789306516927]
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance.
We find that instruction finetuning dramatically improves performance on a variety of model classes.
arXiv Detail & Related papers (2022-10-20T16:58:32Z) - On Data Scaling in Masked Image Modeling [36.00347416479826]
Masked image modeling (MIM) is suspected to be unable to benefit from larger data.
Data scales ranging from 10% of ImageNet-1K to full ImageNet-22K, model sizes ranging from 49 million to 1 billion, and training lengths ranging from 125K iterations to 500K iterations.
validation loss in pre-training is a good indicator to measure how well the model performs for fine-tuning on multiple tasks.
arXiv Detail & Related papers (2022-06-09T17:58:24Z) - On Resource-Efficient Bayesian Network Classifiers and Deep Neural
Networks [14.540226579203207]
We present two methods to reduce the complexity of Bayesian network (BN) classifiers.
First, we introduce quantization-aware training using the straight-through gradient estimator to quantize the parameters of BNs to few bits.
Second, we extend a recently proposed differentiable tree-augmented naive Bayes (TAN) structure learning approach by also considering the model size.
arXiv Detail & Related papers (2020-10-22T14:47:55Z) - Scalable and Practical Natural Gradient for Large-Scale Deep Learning [19.220930193896404]
SP-NGD scales to large mini-batch sizes with a negligible computational overhead as compared to first-order methods.
We demonstrate convergence to a top-1 validation accuracy of 75.4% in 5.5 minutes using a mini-batch size of 32,768 with 1,024 GPUs, as well as an accuracy of 74.9% with an extremely large mini-batch size of 131,072 in 873 steps of SP-NGD.
arXiv Detail & Related papers (2020-02-13T11:55:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.