Related papers: Foundations of Large Language Model Compression -- Part 1: Weight Quantization

Foundations of Large Language Model Compression -- Part 1: Weight Quantization

URL: http://arxiv.org/abs/2409.02026v2
Date: Thu, 3 Oct 2024 16:31:59 GMT
Title: Foundations of Large Language Model Compression -- Part 1: Weight Quantization
Authors: Sean I. Young,
Abstract summary: Compression of large language models (LLMs) has emerged as an important problem to enable language model deployment on resource-constrained devices. We propose a quantization technique that builds on this foundation for optimum quantization outcomes. Our framework, CVXQ, scales to models containing hundreds of billions of weight parameters and provides users with the flexibility to compress models to any specified model size.
Score: 6.719003232695071
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, compression of large language models (LLMs) has emerged as an important problem to enable language model deployment on resource-constrained devices, reduce computational costs, and mitigate the environmental footprint of large-scale AI infrastructure. In this paper, we lay down the foundation for LLM quantization from a convex optimization perspective and propose a quantization technique that builds on this foundation for optimum quantization outcomes. Our quantization framework, CVXQ, scales to models containing hundreds of billions of weight parameters and provides users with the flexibility to compress models to any specified model size, post-training. A reference implementation of CVXQ can be obtained from github.com/seannz/cvxq.

Related papers

LatentLLM: Attention-Aware Joint Tensor Compression [50.33925662486034]
Large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources.<n>We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure.
arXiv Detail & Related papers (2025-05-23T22:39:54Z)
Radio: Rate-Distortion Optimization for Large Language Model Compression [6.719003232695071]
Compression of large language models (LLMs) has emerged as a key problem in facilitating LLM deployment on resource-limited devices.<n>We propose a quantization technique based on simple rate-distortion optimization.<n>Our technique scales to models containing hundreds of billions of weight parameters and offers users the flexibility to compress models, post-training, to a model size or accuracy specified by the user.
arXiv Detail & Related papers (2025-05-05T21:17:14Z)
Binary Neural Networks for Large Language Model: A Survey [6.8834621543726815]
Low-bit quantization, as a key technique, reduces memory usage and computational demands by decreasing the bit-width of model parameters. The BitNet team proposed a radically different approach, where quantization is performed from the start of model training, utilizing low-precision binary weights. This paper provides a comprehensive review of these binary quantization techniques.
arXiv Detail & Related papers (2025-02-26T10:14:19Z)
Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z)
GWQ: Gradient-Aware Weight Quantization for Large Language Models [61.17678373122165]
gradient-aware weight quantization (GWQ) is the first quantization approach for low-bit weight quantization that leverages gradients to localize outliers. GWQ retains the corresponding to the top 1% outliers preferentially at FP16 precision, while the remaining non-outlier weights are stored in a low-bit format. In the zero-shot task, GWQ quantized models have higher accuracy compared to other quantization methods.
arXiv Detail & Related papers (2024-10-30T11:16:04Z)
Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation [70.22782550540714]
Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW. We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW.
arXiv Detail & Related papers (2024-08-07T12:42:09Z)
Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models [0.0]
An increase in the number of connected devices around the world warrants compressed models that can be easily deployed at the local devices with low compute capacity and power accessibility. We implemented both, quantization and pruning, compression techniques on popular deep learning models used in the image classification, object detection, language models and generative models-based problem statements.
arXiv Detail & Related papers (2024-07-22T14:20:53Z)
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models. We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z)
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z)
Retrieval-based Knowledge Transfer: An Effective Approach for Extreme Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks. However, the massive size of these models poses huge challenges for their deployment in real-world applications. We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z)
EQ-Net: Elastic Quantization Neural Networks [15.289359357583079]
Elastic Quantization Neural Networks (EQ-Net) aims to train a robust weight-sharing quantization supernet. We propose an elastic quantization space (including elastic bit-width, granularity, and symmetry) to adapt to various mainstream quantitative forms. We incorporate genetic algorithms and the proposed Conditional Quantization-Aware Conditional Accuracy Predictor (CQAP) as an estimator to quickly search mixed-precision quantized neural networks in supernet.
arXiv Detail & Related papers (2023-08-15T08:57:03Z)
Scaling Pre-trained Language Models to Deeper via Parameter-efficient Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO) MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts. Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z)
Direction is what you need: Improving Word Embedding Compression in Large Language Models [7.736463504706344]
This paper presents a novel loss objective to compress token embeddings in Transformer-based models by leveraging an AutoEncoder architecture. Our method significantly outperforms the commonly used SVD-based matrix-factorization approach in terms of initial language model Perplexity.
arXiv Detail & Related papers (2021-06-15T14:28:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.