Related papers: One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments

One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments

URL: http://arxiv.org/abs/2405.20202v1
Date: Thu, 30 May 2024 16:05:15 GMT
Title: One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments
Authors: Ke Yi, Yuhui Xu, Heng Chang, Chen Tang, Yuan Meng, Tong Zhang, Jia Li,
Abstract summary: Large Language Models (LLMs) have advanced rapidly but face significant memory demands. Current methods typically require lengthy training to alleviate the performance degradation from quantization loss. We make an initial attempt to extend the once-for-all framework to large language models.
Score: 43.107261545706415
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have advanced rapidly but face significant memory demands. While quantization has shown promise for LLMs, current methods typically require lengthy training to alleviate the performance degradation from quantization loss. However, deploying LLMs across diverse scenarios with different resource constraints, e.g., servers and personal computers, requires repeated training per application, which amplifies the lengthy training problem. Given that, it is advantageous to train a once-for-all (OFA) supernet capable of yielding diverse optimal subnets for downstream applications through one-shot training. Nonetheless, the scale of current language models impedes efficiency and amplifies interference from weight sharing between subnets. We make an initial attempt to extend the once-for-all framework to large language models. Specifically, we decouple shared weights to eliminate the interference and incorporate Low-Rank adapters for training efficiency. Furthermore, we observe the imbalance allocation of training resources from the traditional uniform sampling. A non-parametric scheduler is introduced to adjust the sampling rate for each quantization configuration, achieving a more balanced allocation among subnets with varying demands. We validate the approach on LLaMA2 families, and downstream evaluation confirms our ability to maintain high performance while significantly reducing deployment time faced with multiple scenarios.

Related papers

RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE) RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Federated Split Learning with Model Pruning and Gradient Quantization in Wireless Networks [7.439160287320074]
Federated split learning (FedSL) implements collaborative training across the edge devices and the server through model splitting. We propose a lightweight FedSL scheme, that further alleviates the training burden on resource-constrained edge devices. We conduct theoretical analysis to quantify the convergence performance of the proposed scheme.
arXiv Detail & Related papers (2024-12-09T11:43:03Z)
Multi-Bin Batching for Increasing LLM Inference Throughput [19.652542432683234]
Large language models (LL) grow in popularity improving the efficiency of their systems. requests is a critical step in scheduling jobs on servers. requests often have varying generation lengths, causing resource underutilization. We formalize this problem from a queueing-theoretic perspective, and aim to design a throughput control policy.
arXiv Detail & Related papers (2024-12-03T03:16:12Z)
Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation [70.22782550540714]
Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW. We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW.
arXiv Detail & Related papers (2024-08-07T12:42:09Z)
Save It All: Enabling Full Parameter Tuning for Federated Large Language Models via Cycle Block Gradient Descent [15.463595798992621]
Large language models (LLMs) have revolutionized the deep learning paradigm, yielding impressive results across a wide array of tasks. Existing solutions make the unrealistic assumption that the entire model is exchanged for training. We introduce a novel method for the efficient training and fine-tuning of LLMs in FL, with minimal resource consumption.
arXiv Detail & Related papers (2024-06-17T03:49:44Z)
Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-13T07:57:27Z)
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [67.38165028487242]
We introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach to fine-tune large language models (LLMs) Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs. Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs.
arXiv Detail & Related papers (2023-10-13T07:38:52Z)
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models [85.02796681773447]
We propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation. QA-LoRA is easily implemented with a few lines of code.
arXiv Detail & Related papers (2023-09-26T07:22:23Z)
Efficient Parallel Split Learning over Resource-constrained Wireless Edge Networks [44.37047471448793]
In this paper, we advocate the integration of edge computing paradigm and parallel split learning (PSL) We propose an innovative PSL framework, namely, efficient parallel split learning (EPSL) to accelerate model training. We show that the proposed EPSL framework significantly decreases the training latency needed to achieve a target accuracy.
arXiv Detail & Related papers (2023-03-26T16:09:48Z)
Multimodal Parameter-Efficient Few-Shot Class Incremental Learning [1.9220716793379256]
Few-Shot Class Incremental Learning (FSCIL) is a challenging continual learning task, where limited training examples are available during several learning sessions. To succeed in this task, it is necessary to avoid over-fitting new classes caused by biased distributions in the few-shot training sets. CPE-CLIP significantly improves FSCIL performance compared to state-of-the-art proposals while also drastically reducing the number of learnable parameters and training costs.
arXiv Detail & Related papers (2023-03-08T17:34:15Z)
Towards Efficient Post-training Quantization of Pre-trained Language Models [85.68317334241287]
We study post-training quantization(PTQ) of PLMs, and propose module-wise quantization error minimization(MREM), an efficient solution to mitigate these issues. Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
arXiv Detail & Related papers (2021-09-30T12:50:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.