Related papers: When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization

When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization

URL: http://arxiv.org/abs/2411.05882v1
Date: Fri, 08 Nov 2024 07:24:49 GMT
Title: When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization
Authors: Jacob Nielsen, Lukas Galke, Peter Schneider-Kamp,
Abstract summary: We show that decoder-only language models can be trained to a competitive state with ternary weights (1.58 bits per weight) Our results show that 1.58-bit training is on par with or sometimes even better than the standard 32/16-bit models.
Score: 5.67099529296254
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Contemporary machine learning models, such as language models, are powerful, but come with immense resource requirements both at training and inference time. It has been shown that decoder-only language models can be trained to a competitive state with ternary weights (1.58 bits per weight), facilitating efficient inference. Here, we start our exploration with non-transformer model architectures, investigating 1.58-bit training for multi-layer perceptrons and graph neural networks. Then, we explore 1.58-bit training in other transformer-based language models, namely encoder-only and encoder-decoder models. Our results show that in all of these settings, 1.58-bit training is on par with or sometimes even better than the standard 32/16-bit models.

Related papers

Perception Encoder: The best visual embeddings are not at the output of the network [70.86738083862099]
We introduce Perception (PE), a vision encoder for image and video understanding trained via simple vision-language learning. We find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. Together, our PE family of models achieves best-in-class results on a wide variety of tasks.
arXiv Detail & Related papers (2025-04-17T17:59:57Z)
Training and Inference Efficiency of Encoder-Decoder Speech Models [25.031622057759492]
We focus on the efficiency angle and ask the questions of whether we are training these speech models efficiently. We show that negligence in mini-batch sampling leads to more than 50% being spent on padding. We find that adjusting the model architecture to transfer model parameters from the decoder to the encoder results in a 3x inference speedup.
arXiv Detail & Related papers (2025-03-07T20:57:43Z)
Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task [1.9107347888374506]
We study the scaling laws of decoder-only models on the multilingual and multidomain translation task. We show that the loss of decoder-only models can be estimated using a scaling law similar to the one discovered for large language models. We also show that scaling the depth and the width of a model lead to similar test loss improvements, but with different impact on the model's efficiency.
arXiv Detail & Related papers (2024-09-23T14:26:01Z)
BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks [2.2300270962881075]
In this work, we investigate 1.58-bit quantization for small language and vision models ranging from 100K to 48M parameters. We introduce a variant of BitNet b1.58, which allows to rely on the median rather than the mean in the quantization process.
arXiv Detail & Related papers (2024-06-24T20:55:36Z)
What Language Model to Train if You Have One Million GPU Hours? [54.32062236748831]
We study the impact of different modeling practices and their impact on zero-shot generalization. We also study the performance of a multilingual model and how it compares to the English-only one. All our models and code are open-sourced at https://huggingface.co/bigscience.
arXiv Detail & Related papers (2022-10-27T13:43:27Z)
Language Models are General-Purpose Interfaces [109.45478241369655]
We propose to use language models as a general-purpose interface to various foundation models. A collection of pretrained encoders perceive diverse modalities (such as vision, and language) We propose a semi-causal language modeling objective to jointly pretrain the interface and the modular encoders.
arXiv Detail & Related papers (2022-06-13T17:34:22Z)
A Study on Transformer Configuration and Training Objective [33.7272660870026]
We propose Bamboo, an idea of using deeper and narrower transformer configurations for masked autoencoder training. On ImageNet, with such a simple change in configuration, re-designed model achieves 87.1% top-1 accuracy. On language tasks, re-designed model outperforms BERT with default setting by 1.1 points on average.
arXiv Detail & Related papers (2022-05-21T05:17:11Z)
OPT: Open Pre-trained Transformer Language Models [99.60254017109551]
We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters. We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop.
arXiv Detail & Related papers (2022-05-02T17:49:50Z)
ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking Inference [70.36083572306839]
This paper proposes a new training and inference paradigm for re-ranking. We finetune a pretrained encoder-decoder model using in the form of document to query generation. We show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference.
arXiv Detail & Related papers (2022-04-25T06:26:29Z)
What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? [50.84738303888189]
We present a large-scale evaluation of modeling choices and their impact on zero-shot generalization. We train models with over 5 billion parameters for more than 170 billion tokens. We find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models.
arXiv Detail & Related papers (2022-04-12T14:19:49Z)
Very Deep Transformers for Neural Machine Translation [100.51465892354234]
We show that it is feasible to build standard Transformer-based models with up to 60 encoder layers and 12 decoder layers. These deep models outperform their baseline 6-layer counterparts by as much as 2.5 BLEU.
arXiv Detail & Related papers (2020-08-18T07:14:54Z)
Attention Is All You Need [36.87735219227719]
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms. Experiments on two machine translation tasks show these models to be superior in quality.
arXiv Detail & Related papers (2017-06-12T17:57:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.