Related papers: BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration

BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration

URL: http://arxiv.org/abs/2411.11745v1
Date: Mon, 18 Nov 2024 17:16:58 GMT
Title: BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration
Authors: Yuzong Chen, Ahmed F. AbouElhamayed, Xilai Dai, Yang Wang, Marta Andronic, George A. Constantinides, Mohamed S. Abdelfattah,
Abstract summary: Large language models (LLMs) have demonstrated remarkable performance across various machine learning tasks. Yet the substantial memory footprint of LLMs significantly hinders their deployment. We improve the accessibility of LLMs through BitMoD, an algorithm- hardware co-design solution.
Score: 7.774285511386959
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models (LLMs) have demonstrated remarkable performance across various machine learning tasks. Yet the substantial memory footprint of LLMs significantly hinders their deployment. In this paper, we improve the accessibility of LLMs through BitMoD, an algorithm-hardware co-design solution that enables efficient LLM acceleration at low weight precision. On the algorithm side, BitMoD introduces fine-grained data type adaptation that uses a different numerical data type to quantize a group of (e.g., 128) weights. Through the careful design of these new data types, BitMoD is able to quantize LLM weights to very low precision (e.g., 4 bits and 3 bits) while maintaining high accuracy. On the hardware side, BitMoD employs a bit-serial processing element to easily support multiple numerical precisions and data types; our hardware design includes two key innovations: First, it employs a unified representation to process different weight data types, thus reducing the hardware cost. Second, it adopts a bit-serial dequantization unit to rescale the per-group partial sum with minimal hardware overhead. Our evaluation on six representative LLMs demonstrates that BitMoD significantly outperforms state-of-the-art LLM quantization and acceleration methods. For discriminative tasks, BitMoD can quantize LLM weights to 4-bit with $<\!0.5\%$ accuracy loss on average. For generative tasks, BitMoD is able to quantize LLM weights to 3-bit while achieving better perplexity than prior LLM quantization scheme. Combining the superior model performance with an efficient accelerator design, BitMoD achieves an average of $1.69\times$ and $1.48\times$ speedups compared to prior LLM accelerators ANT and OliVe, respectively.

Related papers

Quantizing Large Language Models for Code Generation: A Differentiated Replication [51.85505914274633]
Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language. LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint. New frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70%.
arXiv Detail & Related papers (2025-03-10T09:26:08Z)
Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs [0.8217552831952]
Large language models (LLMs) have transformed the way we think about language understanding and generation. Group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process. We present a groupwise non-uniform codebook-based quantization method for ultra-low-precision quantization of LLMs to better match non-uniform patterns in their weight distributions.
arXiv Detail & Related papers (2024-12-23T03:44:29Z)
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization [13.622268474310918]
ShiftAddLLM is an efficient multiplication-free model for large language models. It achieves perplexity improvements of 5.6 and 22.7 points at comparable or lower latency. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM.
arXiv Detail & Related papers (2024-06-10T02:47:55Z)
I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models [20.070306492164427]
Post-training quantization serves as a potent technique to accelerate the inference of large language models. Existing works still necessitate a considerable number of floating-point (FP) operations during inference. This limitation hinders the deployment of large language models on the edge and cloud devices. We propose I-LLM, a novel integer-only fully-quantized PTQ framework tailored for large language models.
arXiv Detail & Related papers (2024-05-28T05:56:11Z)
SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs) Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths. This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z)
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits [129.6765656933016]
We introduce a 1-bit Large Language Models (LLMs) variant, namely BitNet b1.58. The 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs. It enables a new paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
arXiv Detail & Related papers (2024-02-27T18:56:19Z)
OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs. Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression [76.73007709690306]
We introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique. SpQR achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs. This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup.
arXiv Detail & Related papers (2023-06-05T17:53:28Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models [14.929695160346276]
Large language models (LLMs) show excellent performance but are compute- and memory-intensive. We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization solution. We demonstrate up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy.
arXiv Detail & Related papers (2022-11-18T18:59:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.