GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning
- URL: http://arxiv.org/abs/2502.12913v2
- Date: Mon, 24 Feb 2025 06:46:54 GMT
- Title: GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning
- Authors: Sifan Zhou, Shuo Wang, Zhihang Yuan, Mingjia Shi, Yuzhang Shang, Dawei Yang,
- Abstract summary: We introduce a novel inference framework for on-device Large Language Models (LLMs) fine-tuning adapters.<n>At its core is the Group-Shared Exponents format, which efficiently represents model parameters in integer format using exponents among parameter groups.<n>We demonstrate that our approach achieves accuracy comparable to BF16-based fine-tuning while significantly reducing 1.85x memory usage.
- Score: 13.043393441785392
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large Language Models (LLMs) fine-tuning technologies have achieved remarkable results. However, traditional LLM fine-tuning approaches face significant challenges: they require large Floating Point (FP) computation, raising privacy concerns when handling sensitive data, and are impractical for resource-constrained edge devices. While Parameter-Efficient Fine-Tuning (PEFT) techniques reduce trainable parameters, their reliance on floating-point arithmetic creates fundamental incompatibilities with edge hardware. In this work, we introduce a novel framework for on-device LLM fine-tuning that eliminates the need for floating-point operations in both inference and training, named GSQ-Tuning. At its core is the Group-Shared Exponents Integer format, which efficiently represents model parameters in integer format using shared exponents among parameter groups. When combined with LoRA-like adapters, this enables fully integer-based fine-tuning that is both memory and compute efficient. We demonstrate that our approach achieves accuracy comparable to BF16-based fine-tuning while significantly reducing 1.85x memory usage. Moreover, compared to FP8, our method can reduce 5x power consumption and 11x chip area with same performance, making large-scale model adaptation feasible on edge devices.
Related papers
- Structure-Activation Synergy: A Dual Efficiency Framework for Parameter-Memory Optimized Transfer Learning [8.602744958104969]
We present Structure-Activation Synergy (S2A), an innovative framework achieving dual optimization of parameters and memory.
We show S2A's superior efficiency, reducing GPU memory consumption by 75% (4.2 average reduction) while maintaining 98.7% of full fine-tuning accuracy with only 0.9% tunable parameters.
arXiv Detail & Related papers (2025-03-11T08:10:03Z) - Zero Token-Driven Deep Thinking in LLMs: Unlocking the Full Potential of Existing Parameters via Cyclic Refinement [43.548042892597536]
We propose the Zero Token Transformer (ZTT), which features a head-tail decoupled parameter cycling method.<n>We disentangle the first (head) and last (tail) layers from parameter cycling and iteratively refine only the intermediate layers.<n>Our approach achieves superior performance under tight parameter budgets, effectively reduces computational overhead via early exits, and can be readily applied to fine-tune existing pre-trained models.
arXiv Detail & Related papers (2025-02-17T04:37:22Z) - Hyper Compressed Fine-Tuning of Large Foundation Models with Quantum Inspired Adapters [0.0]
emphQuantum-Inspired Adapters, a PEFT approach inspired by Hamming-weight quantum circuits from quantum machine learning literature.<n>We test our proposed adapters by adapting large language models and large vision transformers on benchmark datasets.
arXiv Detail & Related papers (2025-02-10T13:06:56Z) - IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models [68.55148272295916]
We propose IntLoRA, to push the efficiency limits by using integer type (INT) low-rank parameters to adapt the quantized diffusion models.
IntLoRA offers three key advantages: (i) for fine-tuning, the pre-trained weights are quantized, reducing memory usage; (ii) for storage, both pre-trained and low-rank weights are in INT which consumes less disk space; (iii) for inference, IntLoRA weights can be naturally merged into quantized pre-trained weights through efficient integer multiplication or bit-shifting.
arXiv Detail & Related papers (2024-10-29T05:50:17Z) - Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines [17.539008562641303]
Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers.
Next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data.
Fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands.
arXiv Detail & Related papers (2024-09-23T20:14:09Z) - SpaFL: Communication-Efficient Federated Learning with Sparse Models and Low computational Overhead [75.87007729801304]
SpaFL: a communication-efficient FL framework is proposed to optimize sparse model structures with low computational overhead.<n>To optimize the pruning process itself, only thresholds are communicated between a server and clients instead of parameters.<n>Global thresholds are used to update model parameters by extracting aggregated parameter importance.
arXiv Detail & Related papers (2024-06-01T13:10:35Z) - FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z) - Federated Learning of Large Language Models with Parameter-Efficient
Prompt Tuning and Adaptive Optimization [71.87335804334616]
Federated learning (FL) is a promising paradigm to enable collaborative model training with decentralized data.
The training process of Large Language Models (LLMs) generally incurs the update of significant parameters.
This paper proposes an efficient partial prompt tuning approach to improve performance and efficiency simultaneously.
arXiv Detail & Related papers (2023-10-23T16:37:59Z) - Boosting Inference Efficiency: Unleashing the Power of Parameter-Shared
Pre-trained Language Models [109.06052781040916]
We introduce a technique to enhance the inference efficiency of parameter-shared language models.
We also propose a simple pre-training technique that leads to fully or partially shared models.
Results demonstrate the effectiveness of our methods on both autoregressive and autoencoding PLMs.
arXiv Detail & Related papers (2023-10-19T15:13:58Z) - Memory-Efficient Fine-Tuning of Compressed Large Language Models via
sub-4-bit Integer Quantization [27.79783067245817]
Large language models (LLMs) face the challenges in fine-tuning and deployment due to their high memory demands and computational costs.
This paper presents Efficient Adaptation and Quantization-aware (PEQA) - a simple yet effective method that combines the advantages of PEFT with quantized LLMs.
arXiv Detail & Related papers (2023-05-23T15:20:01Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of
Large-Scale Pre-Trained Language Models [19.640997611256168]
We propose AlphaTuning, consisting of post-training quantization of the pre-trained language model and fine-tuning only some parts of quantized parameters for a target task.
Specifically, AlphaTuning works by employing binary-coding quantization, which factorizes the full-precision parameters into binary parameters and a separate set of scaling factors.
We demonstrate that AlphaTuning, when applied to GPT-2 and OPT, performs competitively with full fine-tuning on a variety of downstream tasks while achieving >10x compression ratio under 4-bit quantization and >1,000x reduction in the number of trainable parameters.
arXiv Detail & Related papers (2022-10-08T00:36:00Z) - Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned
Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network.
We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.