Related papers: DARE the Extreme: Revisiting Delta-Parameter Pruning For Fine-Tuned Models

DARE the Extreme: Revisiting Delta-Parameter Pruning For Fine-Tuned Models

URL: http://arxiv.org/abs/2410.09344v1
Date: Sat, 12 Oct 2024 03:21:58 GMT
Title: DARE the Extreme: Revisiting Delta-Parameter Pruning For Fine-Tuned Models
Authors: Wenlong Deng, Yize Zhao, Vala Vakilian, Minghui Chen, Xiaoxiao Li, Christos Thrampoulidis,
Abstract summary: We introduce DAREx-q, a rescaling factor modification that significantly boosts performance at high pruning rates. We demonstrate that DAREx-q can be seamlessly combined with vanilla parameter-efficient fine-tuning techniques like LoRA. We revisit the application of importance-based pruning techniques within DPP, demonstrating that they outperform random-based methods when delta parameters are large.
Score: 39.411072236355515
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Storing open-source fine-tuned models separately introduces redundancy and increases response times in applications utilizing multiple models. Delta-parameter pruning (DPP), particularly the random drop and rescale (DARE) method proposed by Yu et al., addresses this by pruning the majority of delta parameters--the differences between fine-tuned and pre-trained model weights--while typically maintaining minimal performance loss. However, DARE fails when either the pruning rate or the magnitude of the delta parameters is large. We highlight two key reasons for this failure: (1) an excessively large rescaling factor as pruning rates increase, and (2) high mean and variance in the delta parameters. To push DARE's limits, we introduce DAREx (DARE the eXtreme), which features two algorithmic improvements: (1) DAREx-q, a rescaling factor modification that significantly boosts performance at high pruning rates (e.g., >30 % on COLA and SST2 for encoder models, with even greater gains in decoder models), and (2) DAREx-L2, which combines DARE with AdamR, an in-training method that applies appropriate delta regularization before DPP. We also demonstrate that DAREx-q can be seamlessly combined with vanilla parameter-efficient fine-tuning techniques like LoRA and can facilitate structural DPP. Additionally, we revisit the application of importance-based pruning techniques within DPP, demonstrating that they outperform random-based methods when delta parameters are large. Through this comprehensive study, we develop a pipeline for selecting the most appropriate DPP method under various practical scenarios.

Related papers

ImPart: Importance-Aware Delta-Sparsification for Improved Model Compression and Merging in LLMs [9.435738597849447]
ImPart is a novel importance-aware delta sparsification approach. It adjusts sparsity ratios of different singular vectors based on their importance.
arXiv Detail & Related papers (2025-04-17T16:39:36Z)
Activated Parameter Locating via Causal Intervention for Model Merging [26.98015572633289]
Model merging combines multiple models into one model, achieving convincing generalization without the necessity of additional training. Existing models have demonstrated that dropping a portion of delta parameters can alleviate conflicts while maintaining performance. We propose an Activated Locating (APL) method that utilizes causal intervention to estimate importance, enabling more precise parameter drops and better conflict mitigation.
arXiv Detail & Related papers (2024-08-18T14:00:00Z)
Adapter-X: A Novel General Parameter-Efficient Fine-Tuning Framework for Vision [52.80792724919329]
We introduce a novel framework named Adapter-X to improve fine-tuning in 2D image and 3D point cloud modalities. It is the first to outperform full fine-tuning in both 2D image and 3D point cloud modalities with significantly fewer parameters, i.e., only 0.20% and 1.88% of original trainable parameters for 2D and 3D classification tasks.
arXiv Detail & Related papers (2024-06-05T08:26:44Z)
SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [53.638791265113625]
Sparsity-Preserved efficient fine-tuning method for large language models. Code will be made available at https://github.com/Lucky-Lance/SPP.
arXiv Detail & Related papers (2024-05-25T04:55:27Z)
DiffEnc: Variational Diffusion with a Learned Encoder [14.045374947755922]
We introduce a data- and depth-dependent mean function in the diffusion process, which leads to a modified diffusion loss. Our proposed framework, DiffEnc, achieves a statistically significant improvement in likelihood on CIFAR-10.
arXiv Detail & Related papers (2023-10-30T17:54:36Z)
Sensi-BERT: Towards Sensitivity Driven Fine-Tuning for Parameter-Efficient BERT [6.029590006321152]
We present Sensi-BERT, a sensitivity driven efficient fine-tuning of BERT models for downstream tasks. Our experiments show the efficacy of Sensi-BERT across different downstream tasks including MNLI, QQP, QNLI, SST-2 and SQuAD.
arXiv Detail & Related papers (2023-07-14T17:24:15Z)
OpenDelta: A Plug-and-play Library for Parameter-efficient Adaptation of Pre-trained Models [81.7855202178564]
We present OpenDelta, an open-source library that overcomes limitations by providing a plug-and-play implementation of various delta tuning methods. Our novel techniques eliminate the need to modify the backbone PTMs' code, making OpenDelta compatible with different, even novel PTMs.
arXiv Detail & Related papers (2023-07-05T16:30:14Z)
Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models [90.24999406296867]
In contrast with the standard fine-tuning, delta tuning only fine-tunes a small portion of the model parameters while keeping the rest untouched. Recent studies have demonstrated that a series of delta tuning methods with distinct tuned parameter selection could achieve performance on a par with full- parameter fine-tuning.
arXiv Detail & Related papers (2022-03-14T07:56:32Z)
A Generic Network Compression Framework for Sequential Recommender Systems [71.81962915192022]
Sequential recommender systems (SRS) have become the key technology in capturing user's dynamic interests and generating high-quality recommendations. We propose a compressed sequential recommendation framework, termed as CpRec, where two generic model shrinking techniques are employed. By the extensive ablation studies, we demonstrate that the proposed CpRec can achieve up to 4$sim$8 times compression rates in real-world SRS datasets.
arXiv Detail & Related papers (2020-04-21T08:40:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.