PACE: Marrying generalization in PArameter-efficient fine-tuning with Consistency rEgularization
- URL: http://arxiv.org/abs/2409.17137v4
- Date: Wed, 15 Jan 2025 16:56:26 GMT
- Title: PACE: Marrying generalization in PArameter-efficient fine-tuning with Consistency rEgularization
- Authors: Yao Ni, Shan Zhang, Piotr Koniusz,
- Abstract summary: PACE is a generalization of PArameter-efficient fine-tuning with Consistency rEgularization.
It implicitly regularizes gradients for enhanced generalization, but also implicitly aligns the fine-tuned and pre-trained models to retain knowledge.
It also improves LoRA in text classification (GLUE) and mathematical reasoning.
- Score: 35.922096876707975
- License:
- Abstract: Parameter-Efficient Fine-Tuning (PEFT) effectively adapts pre-trained transformers to downstream tasks. However, the optimization of tasks performance often comes at the cost of generalizability in fine-tuned models. To address this issue, we theoretically connect smaller weight gradient norms during training and larger datasets to the improvements in model generalization. Motivated by this connection, we propose reducing gradient norms for enhanced generalization and aligning fine-tuned model with the pre-trained counterpart to retain knowledge from large-scale pre-training data. Yet, naive alignment does not guarantee gradient reduction and can potentially cause gradient explosion, complicating efforts to manage gradients. To address such an issue, we propose PACE, marrying generalization of PArameter-efficient fine-tuning with Consistency rEgularization. We perturb features learned from the adapter with the multiplicative noise and ensure the fine-tuned model remains consistent for same sample under different perturbations. Theoretical analysis shows that PACE not only implicitly regularizes gradients for enhanced generalization, but also implicitly aligns the fine-tuned and pre-trained models to retain knowledge. Experimental evidence supports our theories. PACE surpasses existing PEFT methods in visual adaptation tasks (VTAB-1k, FGVC, few-shot learning, domain adaptation) showcasing its potential for resource-efficient fine-tuning. It also improves LoRA in text classification (GLUE) and mathematical reasoning (GSM-8K). The code is available at https://github.com/MaxwellYaoNi/PACE
Related papers
- HG-Adapter: Improving Pre-Trained Heterogeneous Graph Neural Networks with Dual Adapters [53.97380482341493]
"pre-train, prompt-tuning" has demonstrated impressive performance for tuning pre-trained heterogeneous graph neural networks (HGNNs)
We propose a unified framework that combines two new adapters with potential labeled data extension to improve the generalization of pre-trained HGNN models.
arXiv Detail & Related papers (2024-11-02T06:43:54Z) - Large Continual Instruction Assistant [59.585544987096974]
Continual Instruction Tuning (CIT) is adopted to instruct Large Models to follow human intent data by data.
Existing update gradient would heavily destroy the performance on previous datasets during CIT process.
We propose a general continual instruction tuning framework to address the challenge.
arXiv Detail & Related papers (2024-10-08T11:24:59Z) - AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods [17.043034606088234]
We introduce AdAdaGrad's scalar variant AdAdaGradNorm, which increase sizes during training.
We also perform image classification experiments, highlighting the merits of our proposed strategies.
arXiv Detail & Related papers (2024-02-17T07:49:50Z) - Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT)
We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z) - DR-Tune: Improving Fine-tuning of Pretrained Visual Models by
Distribution Regularization with Semantic Calibration [38.4461170690033]
We propose a novel fine-tuning framework, namely distribution regularization with semantic calibration (DR-Tune)
DR-Tune employs distribution regularization by enforcing the downstream task head to decrease its classification error on the pretrained feature distribution.
To alleviate the interference by semantic drift, we develop the semantic calibration (SC) module.
arXiv Detail & Related papers (2023-08-23T10:59:20Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Trainable Projected Gradient Method for Robust Fine-tuning [36.470333094917436]
We propose Trainable Projected Gradient Method (TPGM) to automatically learn the constraint imposed for each layer for a fine-grained fine-tuning regularization.
This is motivated by formulating fine-tuning as a bi-level constrained optimization problem.
We show that TPGM outperforms existing fine-tuning methods in OOD performance while matching the best in-distribution (ID) performance.
arXiv Detail & Related papers (2023-03-19T17:30:44Z) - Evaluating Prediction-Time Batch Normalization for Robustness under
Covariate Shift [81.74795324629712]
We call prediction-time batch normalization, which significantly improves model accuracy and calibration under covariate shift.
We show that prediction-time batch normalization provides complementary benefits to existing state-of-the-art approaches for improving robustness.
The method has mixed results when used alongside pre-training, and does not seem to perform as well under more natural types of dataset shift.
arXiv Detail & Related papers (2020-06-19T05:08:43Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.