Related papers: SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers

SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers

URL: http://arxiv.org/abs/2410.07383v1
Date: Wed, 9 Oct 2024 19:03:52 GMT
Title: SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers
Authors: Viktoriia Chekalina, Anna Rudenko, Gleb Mezentsev, Alexander Mikhalev, Alexander Panchenko, Ivan Oseledets,
Abstract summary: We propose a new selective PEFT method, namely SparseGrad, that performs well on parameter blocks. We apply SparseGrad to fine-tune BERT and RoBERTa for the NLU task and LLaMa-2 for the Question-Answering task.
Score: 88.68985153780514
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The performance of Transformer models has been enhanced by increasing the number of parameters and the length of the processed text. Consequently, fine-tuning the entire model becomes a memory-intensive process. High-performance methods for parameter-efficient fine-tuning (PEFT) typically work with Attention blocks and often overlook MLP blocks, which contain about half of the model parameters. We propose a new selective PEFT method, namely SparseGrad, that performs well on MLP blocks. We transfer layer gradients to a space where only about 1\% of the layer's elements remain significant. By converting gradients into a sparse structure, we reduce the number of updated parameters. We apply SparseGrad to fine-tune BERT and RoBERTa for the NLU task and LLaMa-2 for the Question-Answering task. In these experiments, with identical memory requirements, our method outperforms LoRA and MeProp, robust popular state-of-the-art PEFT approaches.

Related papers

1LoRA: Summation Compression for Very Low-Rank Adaptation [6.00844864296448]
We study the "very low rank regime", where we fine-tune the lowest amount of parameters per linear layer for each considered PEFT method. We propose 1LoRA, a compute, parameter and memory efficient fine-tuning method which uses the feature sum as fixed compression and a single trainable vector as decompression.
arXiv Detail & Related papers (2025-03-11T11:45:20Z)
Sparsity May Be All You Need: Sparse Random Parameter Adaptation [7.269130161558109]
Full fine-tuning of large language models for alignment and task adaptation has become prohibitively expensive as models have grown in size. We propose reducing the number of trainable parameters by randomly selecting a small proportion of the model parameters to train on.
arXiv Detail & Related papers (2025-02-21T22:23:16Z)
OP-LoRA: The Blessing of Dimensionality [93.08208871549557]
Low-rank adapters enable fine-tuning of large models with only a small number of parameters. They often pose optimization challenges, with poor convergence. We introduce an over- parameterized approach that accelerates training without increasing inference costs. We achieve improvements in vision-language tasks and especially notable increases in image generation.
arXiv Detail & Related papers (2024-12-13T18:55:19Z)
ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts. Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z)
Gradient-Mask Tuning Elevates the Upper Limits of LLM Performance [51.36243421001282]
Gradient-Mask Tuning (GMT) is a method that selectively updates parameters during training based on their gradient information. Our empirical results across various tasks demonstrate that GMT not only outperforms traditional fine-tuning methods but also elevates the upper limits of LLM performance.
arXiv Detail & Related papers (2024-06-21T17:42:52Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
Sparse Matrix in Large Language Model Fine-tuning [1.9874264019909988]
We introduce a method for selecting sparse sub-matrices that aim to minimize the performance gap between PEFT vs. full fine-tuning. In experiments, we demonstrate that our method consistently surpasses other PEFT baselines. We also examine how the performance of LoRA and DoRA tends to plateau and decline as the number of trainable parameters increases.
arXiv Detail & Related papers (2024-05-24T13:12:14Z)
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation [54.28841287750586]
Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc. Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning. This paper introduces a novel LLM pruning technique dubbed blockwise parameter-efficient sparsity allocation (BESA) by applying a blockwise reconstruction loss.
arXiv Detail & Related papers (2024-02-18T12:44:15Z)
From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers [52.199303258423306]
We propose a novel density loss that encourages higher activation sparsity in pre-trained models. Our proposed method, textbfDEFT, can consistently reduce activation density by up to textbf44.94% on RoBERTa$_mathrmLarge$ and by textbf53.19% (encoder density) and textbf90.60% (decoder density) on Flan-T5$_mathrmXXL$.
arXiv Detail & Related papers (2024-02-02T21:25:46Z)
QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources [37.265708531464746]
Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks. Fine-tuning these pre-trained models on downstream datasets provides further significant performance gains, but this process has been challenging due to its extraordinary resource requirements. We propose QFT, a novel Quantized Full- parameter Tuning framework for LLMs that enables memory-efficient fine-tuning without harming performance.
arXiv Detail & Related papers (2023-10-11T02:47:40Z)
Make Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning [6.451743797015637]
We propose memory-efficient fine-tuning (MEFT) for pre-trained language models. MEFT inserts adapters into a PLM, preserving the PLM's starting point and making it reversible without additional pre-training. MEFT significantly reduces the activation memory up to 84% of full fine-tuning with a negligible amount of trainable parameters.
arXiv Detail & Related papers (2023-06-01T09:26:17Z)
Towards Robust Low-Resource Fine-Tuning with Multi-View Compressed Representations [51.75960511842552]
Fine-tuning of pretrained language models (PLMs) is prone to overfitting in the low resource scenarios. We present a novel method that operates on the hidden representations of a PLM to reduce overfitting.
arXiv Detail & Related papers (2022-11-16T09:39:29Z)
Sparse-MLP: A Fully-MLP Architecture with Conditional Computation [7.901786481399378]
Mixture-of-Experts (MoE) with sparse conditional computation has been proved an effective architecture for scaling attention-based models to more parameters with comparable computation cost. We propose Sparse-MLP, scaling the recent-Mixer model with MoE, to achieve a more-efficient architecture.
arXiv Detail & Related papers (2021-09-05T06:43:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.