FlexQuant: Elastic Quantization Framework for Locally Hosted LLM on Edge Devices
- URL: http://arxiv.org/abs/2501.07139v1
- Date: Mon, 13 Jan 2025 08:58:00 GMT
- Title: FlexQuant: Elastic Quantization Framework for Locally Hosted LLM on Edge Devices
- Authors: Yuji Chai, Mujin Kwen, David Brooks, Gu-Yeon Wei,
- Abstract summary: Memory elasticity is crucial for edge devices with unified memory, where memory is shared and fluctuates dynamically.<n>We propose FlexQuant, a novel elasticity framework that generates an ensemble of quantized models.
- Score: 3.950064543723201
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deploying LLMs on edge devices presents serious technical challenges. Memory elasticity is crucial for edge devices with unified memory, where memory is shared and fluctuates dynamically. Existing solutions suffer from either poor transition granularity or high storage costs. We propose FlexQuant, a novel elasticity framework that generates an ensemble of quantized models, providing an elastic hosting solution with 15x granularity improvement and 10x storage reduction compared to SoTA methods. FlexQuant works with most quantization methods and creates a family of trade-off options under various storage limits through our pruning method. It brings great performance and flexibility to the edge deployment of LLMs.
Related papers
- FlexLoRA: Entropy-Guided Flexible Low-Rank Adaptation [33.208889745659825]
Large pre-trained models achieve remarkable success across diverse domains, yet fully fine-tuning incurs prohibitive computational and memory costs.<n>We propose FlexLoRA, an entropy-guided flexible low-rank adaptation framework that supports rank pruning and expansion under a global budget.<n>Experiments show that FlexLoRA consistently outperforms state-of-the-art baselines across benchmarks.
arXiv Detail & Related papers (2026-01-30T12:25:47Z) - LLM-MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering [52.41664454251679]
Large Language Models (LLMs) are reshaping unsupervised learning by offering an unprecedented ability to perform text clustering.<n>Existing methods often rely on complex pipelines with external modules, sacrificing a truly end-to-end approach.<n>We introduce LLM-MemCluster, a novel framework that reconceptualizes clustering as a fully LLM-native task.
arXiv Detail & Related papers (2025-11-19T13:22:08Z) - XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization [58.92253769255316]
LLM inference is challenging due to substantial memory footprint and bandwidth requirements.<n>XQuant exploits the rapidly increasing compute capabilities of hardware platforms to eliminate the memory bottleneck.<n>XQuant-CL exploits the cross-layer similarity in the X embeddings for extreme compression.
arXiv Detail & Related papers (2025-08-14T06:52:38Z) - BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving [3.620158146761518]
BucketServe is a bucket-based dynamic framework designed to optimize inference performance.<n>It can handle 1.93x more request load load attainment of 80% compared with UELLM and demonstrates 1.975x higher system load capacity compared to the UELLM.
arXiv Detail & Related papers (2025-07-23T01:51:48Z) - LatentLLM: Attention-Aware Joint Tensor Compression [50.33925662486034]
Large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources.<n>We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure.
arXiv Detail & Related papers (2025-05-23T22:39:54Z) - FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization [18.041828697950812]
We propose FlexQuant, a dynamic precision-switching framework to optimize the trade-off between inference speed and accuracy.<n>Our work provides a comprehensive analysis of quantization strategies, introduces a precision requirement model for optimal switching, and implements efficient fine-grained precision management.<n> Experimental results demonstrate that FlexQuant achieves a 1.3x end-to-end speedup across diverse language tasks with negligible accuracy loss.
arXiv Detail & Related papers (2025-05-21T07:42:53Z) - Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator [5.985414012866983]
Large language model (LLM) pruning with fixed N:M structured sparsity limits the expressivity of the sparse model.
We present a flexible layer-wise outlier-density-aware N:M sparsity (FLOW) selection method.
We then introduce a flexible, low-overhead digital compute-in-memory architecture (FlexCiM)
arXiv Detail & Related papers (2025-04-19T17:47:01Z) - FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference [10.755373001278402]
Large Language Models (LLMs) face challenges for on-device inference due to high memory demands.
We propose FlexInfer, an optimized offloading framework for on-device inference.
arXiv Detail & Related papers (2025-03-04T20:08:03Z) - Sparse Gradient Compression for Fine-Tuning Large Language Models [58.44973963468691]
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models.
High memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size.
We propose sparse compression gradient (SGC) to address these limitations.
arXiv Detail & Related papers (2025-02-01T04:18:28Z) - LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment [13.235417359529965]
We propose LSAQ (Layer-Specific Adaptive Quantization), a system for adaptive quantization and dynamic deployment of large language models (LLMs) based on layer importance.<n>The system adaptively adjusts quantization strategies in real time according to the resource availability of edge devices, assigning different precision levels to layers of varying importance.
arXiv Detail & Related papers (2024-12-24T03:43:15Z) - Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z) - InfiniPot: Infinite Context Processing on Memory-Constrained LLMs [17.111422610001227]
InfiniPot is a novel KV cache control framework designed to enable pre-trained Large Language Models to manage extensive sequences efficiently.
InfiniPot effectively maintains critical data even without access to future context.
This work represents a substantial advancement toward making Large Language Models applicable to a broader range of real-world scenarios.
arXiv Detail & Related papers (2024-10-02T13:09:41Z) - ULLME: A Unified Framework for Large Language Model Embeddings with Generation-Augmented Learning [72.90823351726374]
We introduce the Unified framework for Large Language Model Embedding (ULLME), a flexible, plug-and-play implementation that enables bidirectional attention across various LLMs.
We also propose Generation-augmented Representation Learning (GRL), a novel fine-tuning method to boost LLMs for text embedding tasks.
To showcase our framework's flexibility and effectiveness, we release three pre-trained models from ULLME with different backbone architectures.
arXiv Detail & Related papers (2024-08-06T18:53:54Z) - Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.
At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z) - Flextron: Many-in-One Flexible Large Language Model [85.93260172698398]
We introduce Flextron, a network architecture and post-training model optimization framework supporting flexible model deployment.
We present a sample-efficient training method and associated routing algorithms for transforming an existing trained LLM into a Flextron model.
We demonstrate superior performance over multiple end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes a mere 7.63% tokens compared to original pretraining.
arXiv Detail & Related papers (2024-06-11T01:16:10Z) - Online Adaptation of Language Models with a Memory of Amortized Contexts [82.02369596879817]
Memory of Amortized Contexts (MAC) is an efficient and effective online adaptation framework for large language models.
We show how MAC can be combined with and improve the performance of popular alternatives such as retrieval augmented generations.
arXiv Detail & Related papers (2024-03-07T08:34:57Z) - On the Compressibility of Quantized Large Language Models [13.443384050034922]
Large Language Models (LLMs) are deployed on edge or mobile devices to offer enhanced data privacy and real-time processing capabilities.
LLMs may still be too big to fit entirely into the limited memory of edge or mobile devices and have to be partially loaded from the storage to complete the inference.
We study applying data compression techniques to reduce data movement and thus speed up the inference of quantized LLM on memory-constrained devices.
arXiv Detail & Related papers (2024-03-03T03:27:07Z) - Cloud-Device Collaborative Learning for Multimodal Large Language Models [24.65882336700547]
We introduce a Cloud-Device Collaborative Continual Adaptation framework to enhance the performance of compressed, device-deployed MLLMs.
Our framework is structured into three key components: a device-to-cloud uplink for efficient data transmission, cloud-based knowledge adaptation, and an optimized cloud-to-device downlink for model deployment.
arXiv Detail & Related papers (2023-12-26T18:46:14Z) - AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [54.692405042065815]
We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization.
AWQ protects only 1% salient weights and achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs.
We also implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs.
arXiv Detail & Related papers (2023-06-01T17:59:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.