Related papers: The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

URL: http://arxiv.org/abs/2402.17764v1
Date: Tue, 27 Feb 2024 18:56:19 GMT
Title: The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Authors: Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei
Abstract summary: We introduce a 1-bit Large Language Models (LLMs) variant, namely BitNet b1.58. The 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs. It enables a new paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
Score: 129.6765656933016
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.

Related papers

Bitnet.cpp: Efficient Edge Inference for Ternary LLMs [71.5759603658299]
We introduce Bitnet, an inference system optimized for BitNet b1.58 and ternary LLMs. Bitnet incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference. Our experiments show that Bitnet achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines.
arXiv Detail & Related papers (2025-02-17T15:06:28Z)
BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration [7.774285511386959]
Large language models (LLMs) have demonstrated remarkable performance across various machine learning tasks.<n>Yet the substantial memory footprint of LLMs significantly hinders their deployment.<n>We improve the accessibility of LLMs through BitMoD, an algorithm- hardware co-design solution.
arXiv Detail & Related papers (2024-11-18T17:16:58Z)
Matmul or No Matmal in the Era of 1-bit LLMs [0.48212500317840945]
1-bit large language models (LLMs) have attracted considerable attention and opened up new research opportunities. However, 1-bit LLMs only improve a fraction of models by applying extreme quantization to the projection layers. In this work, we present an adaptation of Amdahl's Law tailored for the 1-bit LLM context.
arXiv Detail & Related papers (2024-08-21T18:44:21Z)
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated [93.45300714803429]
We introduce Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs) Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference. We also introduce Block Q-Sparse for batch training and inference.
arXiv Detail & Related papers (2024-07-15T17:59:29Z)
FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation [32.01836613286288]
This work presents a Fully BInarized Large Language Model (FBI-LLM) It demonstrates for the first time how to train a large-scale binary language model from scratch.
arXiv Detail & Related papers (2024-07-09T17:59:48Z)
Scalable MatMul-free Language Modeling [8.672867887354977]
We show that MatMul operations can be completely eliminated from large language models. Our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers.
arXiv Detail & Related papers (2024-06-04T17:50:34Z)
OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs. Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [67.38165028487242]
We introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach to fine-tune large language models (LLMs) Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs. Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs.
arXiv Detail & Related papers (2023-10-13T07:38:52Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.