LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment
- URL: http://arxiv.org/abs/2412.18135v1
- Date: Tue, 24 Dec 2024 03:43:15 GMT
- Title: LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment
- Authors: Binrui Zeng, Bin Ji, Xiaodong Liu, Jie Yu, Shasha Li, Jun Ma, Xiaopeng Li, Shangwen Wang, Xinran Hong,
- Abstract summary: We propose LSAQ (Layer-Specific Adaptive Quantization), a system for adaptive quantization and dynamic deployment of large language models (LLMs) based on layer importance.<n>The system adaptively adjusts quantization strategies in real time according to the resource availability of edge devices, assigning different precision levels to layers of varying importance.
- Score: 13.235417359529965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models (LLMs) demonstrate exceptional performance across various domains, the deployment of these models on edge devices has emerged as a new trend. Quantization techniques, which reduce the size and memory footprint of LLMs, are effective for enabling deployment on resource-constrained edge devices. However, existing one-size-fits-all quantization methods often fail to dynamically adjust the memory consumption of LLMs based on specific hardware characteristics and usage scenarios. To address this limitation, we propose LSAQ (Layer-Specific Adaptive Quantization), a system for adaptive quantization and dynamic deployment of LLMs based on layer importance. LSAQ evaluates layer importance by constructing top-k token sets from the inputs and outputs of each layer and calculating their Jaccard coefficient. Using this evaluation, the system adaptively adjusts quantization strategies in real time according to the resource availability of edge devices, assigning different precision levels to layers of varying importance. This approach significantly reduces the storage requirements of LLMs while maintaining model performance, enabling efficient deployment across diverse hardware platforms and usage scenarios.
Related papers
- Quantizing Large Language Models for Code Generation: A Differentiated Replication [51.85505914274633]
Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language.
LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint.
New frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70%.
arXiv Detail & Related papers (2025-03-10T09:26:08Z) - DILEMMA: Joint LLM Quantization and Distributed LLM Inference Over Edge Computing Systems [1.14179290793997]
This paper introduces DILEMMA, a novel framework addressing the challenges of deploying Large Language Models in Edge Computing systems.
DILEMMA formulates an Linear Programming problem to minimize total delay while ensuring acceptable LLM performance levels.
It achieves a quantization ratio of up to 12.75% while preserving model loss, highlighting its effectiveness in resource-constrained environments.
arXiv Detail & Related papers (2025-03-03T16:16:33Z) - IMPROVE: Iterative Model Pipeline Refinement and Optimization Leveraging LLM Agents [17.301758094000125]
Large language model (LLM) agents have emerged as a promising solution to automate the development of computer vision models.
We introduce Iterative Refinement, a novel strategy for LLM-driven ML pipeline design.
Iterative Refinement improves stability, interpretability, and overall model performance.
arXiv Detail & Related papers (2025-02-25T01:52:37Z) - Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.<n>LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.<n>We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z) - AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment [13.977849745488339]
AmoebaLLM is a novel framework designed to enable the instant derivation of large language models of arbitrary shapes.
AmoebaLLM significantly facilitates rapid deployment tailored to various platforms and applications.
arXiv Detail & Related papers (2024-11-15T22:02:28Z) - Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding.
PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models.
Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z) - Efficient Large Foundation Model Inference: A Perspective From Model and System Co-Design [35.40505841618305]
Large Language Models (LLMs) become popular, the need for efficient design for ML models on LLMs grows.<n>This paper focuses on modern efficient inference technologies on LLMs and illustrates them from two perspectives: model and system design.
arXiv Detail & Related papers (2024-09-03T15:35:01Z) - One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models [67.49462724595445]
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs)
We propose a novel method that involves learning scalable and pluggable virtual tokens for RAG.
arXiv Detail & Related papers (2024-05-30T03:44:54Z) - CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs [44.03692512352445]
Column-Level Adaptive weight Quantization (CLAQ) is a novel and effective framework for Large Language Models (LLMs) quantization.
In this paper, we present a novel and effective CLAQ framework by introducing three different types of adaptive strategies for LLM quantization.
Experiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2 and Yi demonstrate that our methods achieve the state-of-the-art results across different bit settings.
arXiv Detail & Related papers (2024-05-27T14:49:39Z) - QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models [85.02796681773447]
We propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm.
The motivation lies in the imbalanced degrees of freedom of quantization and adaptation.
QA-LoRA is easily implemented with a few lines of code.
arXiv Detail & Related papers (2023-09-26T07:22:23Z) - AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [54.692405042065815]
We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization.
AWQ protects only 1% salient weights and achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs.
We also implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs.
arXiv Detail & Related papers (2023-06-01T17:59:10Z) - Memory-Efficient Fine-Tuning of Compressed Large Language Models via
sub-4-bit Integer Quantization [27.79783067245817]
Large language models (LLMs) face the challenges in fine-tuning and deployment due to their high memory demands and computational costs.
This paper presents Efficient Adaptation and Quantization-aware (PEQA) - a simple yet effective method that combines the advantages of PEFT with quantized LLMs.
arXiv Detail & Related papers (2023-05-23T15:20:01Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.