Scaling Sparse Fine-Tuning to Large Language Models
- URL: http://arxiv.org/abs/2401.16405v2
- Date: Fri, 2 Feb 2024 14:53:14 GMT
- Title: Scaling Sparse Fine-Tuning to Large Language Models
- Authors: Alan Ansell and Ivan Vuli\'c and Hannah Sterz and Anna Korhonen and
Edoardo M. Ponti
- Abstract summary: Large Language Models (LLMs) are difficult to fully fine-tune due to their sheer number of parameters.
We propose SpIEL, a novel sparse finetuning method which maintains an array of parameter indices and the deltas of these parameters relative to their pretrained values.
We show that SpIEL is superior to popular parameter-efficient fine-tuning methods like LoRA in terms of performance and comparable in terms of run time.
- Score: 67.59697720719672
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are difficult to fully fine-tune (e.g., with
instructions or human feedback) due to their sheer number of parameters. A
family of parameter-efficient sparse fine-tuning methods have proven promising
in terms of performance but their memory requirements increase proportionally
to the size of the LLMs. In this work, we scale sparse fine-tuning to
state-of-the-art LLMs like LLaMA 2 7B and 13B. We propose SpIEL, a novel sparse
fine-tuning method which, for a desired density level, maintains an array of
parameter indices and the deltas of these parameters relative to their
pretrained values. It iterates over: (a) updating the active deltas, (b)
pruning indices (based on the change of magnitude of their deltas) and (c)
regrowth of indices. For regrowth, we explore two criteria based on either the
accumulated gradients of a few candidate parameters or their approximate
momenta estimated using the efficient SM3 optimizer. We experiment with
instruction-tuning of LLMs on standard dataset mixtures, finding that SpIEL is
often superior to popular parameter-efficient fine-tuning methods like LoRA
(low-rank adaptation) in terms of performance and comparable in terms of run
time. We additionally show that SpIEL is compatible with both quantization and
efficient optimizers, to facilitate scaling to ever-larger model sizes. We
release the code for SpIEL at https://github.com/AlanAnsell/peft and for the
instruction-tuning experiments at https://github.com/ducdauge/sft-llm.
Related papers
- LoLDU: Low-Rank Adaptation via Lower-Diag-Upper Decomposition for Parameter-Efficient Fine-Tuning [4.616740762629019]
Low-Rank Adaptation (LoRA) has sought to address the problem of handling the large updated parameters in full fine-tuning.
We propose LoLDU, a suboptimal-Efficient Fine-Tuning (PEFT) approach that significantly reduces trainable parameters by 2600 times.
arXiv Detail & Related papers (2024-10-17T14:51:17Z) - Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models [33.911521719528686]
Fine-tuning is powerful for adapting large language models to downstream tasks, but it often results in huge memory usages.
A promising approach is using Zeroth-Order (ZO) gradients, which estimates to replace First-Order (FO) gradients.
We introduce a novel layer-wise sparse computation and memory efficient ZO, named LeZO.
arXiv Detail & Related papers (2024-10-13T12:47:37Z) - Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [66.27334633749734]
As language models grow in size, memory demands for backpropagation increase.
Zeroth-order (ZOZO) optimization methods offer a memory-efficient alternative.
We show that SubZero enhances fine-tuning and achieves faster results compared to standard ZOZO approaches.
arXiv Detail & Related papers (2024-10-11T17:01:43Z) - Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models.
We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.
Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs [44.03692512352445]
Column-Level Adaptive weight Quantization (CLAQ) is a novel and effective framework for Large Language Models (LLMs) quantization.
In this paper, we present a novel and effective CLAQ framework by introducing three different types of adaptive strategies for LLM quantization.
Experiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2 and Yi demonstrate that our methods achieve the state-of-the-art results across different bit settings.
arXiv Detail & Related papers (2024-05-27T14:49:39Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Fine-Tuning Language Models with Just Forward Passes [92.04219196752007]
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a large amount of memory.
We propose a memory-efficient zerothorder (MeZO) to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference.
arXiv Detail & Related papers (2023-05-27T02:28:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.