AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of
Large-Scale Pre-Trained Language Models
- URL: http://arxiv.org/abs/2210.03858v1
- Date: Sat, 8 Oct 2022 00:36:00 GMT
- Title: AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of
Large-Scale Pre-Trained Language Models
- Authors: Se Jung Kwon, Jeonghoon Kim, Jeongin Bae, Kang Min Yoo, Jin-Hwa Kim,
Baeseong Park, Byeongwook Kim, Jung-Woo Ha, Nako Sung and Dongsoo Lee
- Abstract summary: We propose AlphaTuning, consisting of post-training quantization of the pre-trained language model and fine-tuning only some parts of quantized parameters for a target task.
Specifically, AlphaTuning works by employing binary-coding quantization, which factorizes the full-precision parameters into binary parameters and a separate set of scaling factors.
We demonstrate that AlphaTuning, when applied to GPT-2 and OPT, performs competitively with full fine-tuning on a variety of downstream tasks while achieving >10x compression ratio under 4-bit quantization and >1,000x reduction in the number of trainable parameters.
- Score: 19.640997611256168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There are growing interests in adapting large-scale language models using
parameter-efficient fine-tuning methods. However, accelerating the model itself
and achieving better inference efficiency through model compression has not
been thoroughly explored yet. Model compression could provide the benefits of
reducing memory footprints, enabling low-precision computations, and ultimately
achieving cost-effective inference. To combine parameter-efficient adaptation
and model compression, we propose AlphaTuning consisting of post-training
quantization of the pre-trained language model and fine-tuning only some parts
of quantized parameters for a target task. Specifically, AlphaTuning works by
employing binary-coding quantization, which factorizes the full-precision
parameters into binary parameters and a separate set of scaling factors. During
the adaptation phase, the binary values are frozen for all tasks, while the
scaling factors are fine-tuned for the downstream task. We demonstrate that
AlphaTuning, when applied to GPT-2 and OPT, performs competitively with full
fine-tuning on a variety of downstream tasks while achieving >10x compression
ratio under 4-bit quantization and >1,000x reduction in the number of trainable
parameters.
Related papers
- SoftLMs: Efficient Adaptive Low-Rank Approximation of Language Models using Soft-Thresholding Mechanism [1.7170348600689374]
We propose a novel compression methodology that dynamically determines the rank of each layer using a soft thresholding mechanism.
We have successfully applied the proposed technique to attention-based architectures, including BERT for discriminative tasks and GPT2 and TinyLlama for generative tasks.
Our experiments demonstrate that the proposed technique achieves a speed-up of 1.33X to 1.72X in the encoder/ decoder with a 50% reduction in total parameters.
arXiv Detail & Related papers (2024-11-15T19:29:51Z) - Propulsion: Steering LLM with Tiny Fine-Tuning [0.0]
We propose Propulsion, a novel parameter efficient fine-tuning (PEFT) method to optimize task-specific performance.
Inspired by the concept of controlled adjustments in physical motion, Propulsion selectively re-scales specific dimensions of a pre-trained model.
Our theoretical analysis, supported by Neural Tangent Kernel (NTK) theory, shows that Propulsion approximates the performance of full fine-tuning with far fewer trainable parameters.
arXiv Detail & Related papers (2024-09-17T06:51:59Z) - Scaling Exponents Across Parameterizations and Optimizers [94.54718325264218]
We propose a new perspective on parameterization by investigating a key assumption in prior work.
Our empirical investigation includes tens of thousands of models trained with all combinations of threes.
We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work.
arXiv Detail & Related papers (2024-07-08T12:32:51Z) - ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane Reflections [59.839926875976225]
We propose the ETHER transformation family, which performs Efficient fineTuning via HypErplane Reflections.
In particular, we introduce ETHER and its relaxation ETHER+, which match or outperform existing PEFT methods with significantly fewer parameters.
arXiv Detail & Related papers (2024-05-30T17:26:02Z) - PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression [31.30170080420504]
State-of-the-art quantization methods include fine-tuning (part of) the compressed parameters over a limited amount of calibration data.
We propose PV-Tuning - a representation-agnostic framework that generalizes and improves upon existing fine-tuning strategies.
arXiv Detail & Related papers (2024-05-23T17:57:04Z) - E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive.
We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation.
Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z) - Scaling & Shifting Your Features: A New Baseline for Efficient Model
Tuning [126.84770886628833]
Existing finetuning methods either tune all parameters of the pretrained model (full finetuning) or only tune the last linear layer (linear probing)
We propose a new parameter-efficient finetuning method termed as SSF, representing that researchers only need to Scale and Shift the deep Features extracted by a pre-trained model to catch up with the performance full finetuning.
arXiv Detail & Related papers (2022-10-17T08:14:49Z) - Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks.
These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices.
We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z) - Intrinsic Dimensionality Explains the Effectiveness of Language Model
Fine-Tuning [52.624194343095304]
We argue that analyzing fine-tuning through the lens of intrinsic dimension provides us with empirical and theoretical intuitions.
We empirically show that common pre-trained models have a very low intrinsic dimension.
arXiv Detail & Related papers (2020-12-22T07:42:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.