LoRA Meets Dropout under a Unified Framework
- URL: http://arxiv.org/abs/2403.00812v2
- Date: Mon, 27 May 2024 02:16:43 GMT
- Title: LoRA Meets Dropout under a Unified Framework
- Authors: Sheng Wang, Liheng Chen, Jiyue Jiang, Boyang Xue, Lingpeng Kong, Chuan Wu,
- Abstract summary: Large language models (LLMs) have emerged as essential elements in numerous NLP applications.
Various dropout methods, initially designed for full finetuning with all the parameters updated, alleviates overfitting associated with excessive parameter redundancy.
We introduce a unified framework for a comprehensive investigation, which instantiates these methods based on dropping position, structural pattern and compensation measure.
- Score: 38.5176197615878
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the remarkable capabilities, large language models (LLMs) have emerged as essential elements in numerous NLP applications, while parameter-efficient finetuning, especially LoRA, has gained popularity as a lightweight approach for model customization. Meanwhile, various dropout methods, initially designed for full finetuning with all the parameters updated, alleviates overfitting associated with excessive parameter redundancy. Hence, a possible contradiction arises from negligible trainable parameters of LoRA and the effectiveness of previous dropout methods, which has been largely overlooked. To fill this gap, we first confirm that parameter-efficient LoRA is also overfitting-prone. We then revisit transformer-specific dropout methods, and establish their equivalence and distinctions mathematically and empirically. Building upon this comparative analysis, we introduce a unified framework for a comprehensive investigation, which instantiates these methods based on dropping position, structural pattern and compensation measure. Through this framework, we reveal the new preferences and performance comparisons of them when involved with limited trainable parameters. This framework also allows us to amalgamate the most favorable aspects into a novel dropout method named HiddenKey. Extensive experiments verify the remarkable superiority and sufficiency of HiddenKey across multiple models and tasks, which highlights it as the preferred approach for high-performance and parameter-efficient finetuning of LLMs.
Related papers
- Mitigating Parameter Degeneracy using Joint Conditional Diffusion Model for WECC Composite Load Model in Power Systems [2.7212274374272543]
We develop a joint conditional diffusion model-based inverse problem solver (JCDI)
JCDI incorporates a joint conditioning architecture with simultaneous inputs of multi-event observations to improve parameter generalizability.
Simulation studies on the WECC CLM show that the proposed JCDI effectively reduces uncertainties of degenerate parameters.
arXiv Detail & Related papers (2024-11-15T18:53:08Z) - Randomized Asymmetric Chain of LoRA: The First Meaningful Theoretical Framework for Low-Rank Adaptation [58.288682735160585]
Low-Rank Adaptation (LoRA) is a popular technique for finetuning models.
LoRA often under performs when compared to full- parameter fine-tuning.
We present a framework that rigorously analyzes the adaptation rates of LoRA methods.
arXiv Detail & Related papers (2024-10-10T18:51:53Z) - LoRTA: Low Rank Tensor Adaptation of Large Language Models [70.32218116940393]
Low Rank Adaptation (LoRA) is a popular Efficient Fine Tuning (PEFT) method that effectively adapts large pre-trained models for downstream tasks.
We propose a novel approach that employs a low rank tensor parametrization for model updates.
Our method is both efficient and effective for fine-tuning large language models, achieving a substantial reduction in the number of parameters while maintaining comparable performance.
arXiv Detail & Related papers (2024-10-05T06:59:50Z) - MoS: Unleashing Parameter Efficiency of Low-Rank Adaptation with Mixture of Shards [35.163843138935455]
The rapid scaling of large language models requires more lightweight finetuning methods to reduce the explosive GPU memory overhead.
Our research highlights the indispensable role of differentiation in reversing the detrimental effects of pure sharing.
We propose Mixture of Shards (MoS), incorporating both inter-layer and intra-layer sharing schemes, and integrating four nearly cost-free differentiation strategies.
arXiv Detail & Related papers (2024-10-01T07:47:03Z) - SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction.
SMILE allows for the upscaling of source models into an MoE model without extra data or further training.
We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z) - Activated Parameter Locating via Causal Intervention for Model Merging [26.98015572633289]
Model merging combines multiple models into one model, achieving convincing generalization without the necessity of additional training.
Existing models have demonstrated that dropping a portion of delta parameters can alleviate conflicts while maintaining performance.
We propose an Activated Locating (APL) method that utilizes causal intervention to estimate importance, enabling more precise parameter drops and better conflict mitigation.
arXiv Detail & Related papers (2024-08-18T14:00:00Z) - MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning [71.50432879573614]
Low-rank adaptation (LoRA) is based on the idea that the adaptation process is intrinsically low-dimensional.
We present MELoRA, a mini-ensemble low-rank adapters that uses fewer trainable parameters while maintaining a higher rank.
Our experimental results show that, compared to LoRA, MELoRA achieves better performance with 8 times fewer trainable parameters on natural language understanding tasks and 36 times fewer trainable parameters on instruction following tasks.
arXiv Detail & Related papers (2024-02-27T07:14:12Z) - Tied-Lora: Enhancing parameter efficiency of LoRA with weight tying [6.172790376076545]
We introduce Tied-LoRA, a novel paradigm leveraging weight tying and selective training to enhance the parameter efficiency of Low-rank Adaptation (LoRA)
Our exploration encompasses different plausible combinations of parameter training and freezing, coupled with weight tying, aimed at identifying the optimal trade-off between performance and the count of trainable parameters.
arXiv Detail & Related papers (2023-11-16T05:29:39Z) - IncreLoRA: Incremental Parameter Allocation Method for
Parameter-Efficient Fine-tuning [15.964205804768163]
IncreLoRA is an incremental parameter allocation method that adaptively adds trainable parameters during training.
We conduct extensive experiments on GLUE to demonstrate the effectiveness of IncreLoRA.
arXiv Detail & Related papers (2023-08-23T10:08:10Z) - On the Effectiveness of Parameter-Efficient Fine-Tuning [79.6302606855302]
Currently, many research works propose to only fine-tune a small portion of the parameters while keeping most of the parameters shared across different tasks.
We show that all of the methods are actually sparse fine-tuned models and conduct a novel theoretical analysis of them.
Despite the effectiveness of sparsity grounded by our theory, it still remains an open problem of how to choose the tunable parameters.
arXiv Detail & Related papers (2022-11-28T17:41:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.