Derivative-Free Optimization for Low-Rank Adaptation in Large Language
Models
- URL: http://arxiv.org/abs/2403.01754v1
- Date: Mon, 4 Mar 2024 06:20:31 GMT
- Title: Derivative-Free Optimization for Low-Rank Adaptation in Large Language
Models
- Authors: Feihu Jin, Yin Liu, Ying Tan
- Abstract summary: We propose a derivative-free optimization method to eschew the computation of gradients and showcase an augmented level of robustness in few-shot settings.
Our proposed method achieves substantial improvement and exhibits clear advantages in memory usage and convergence speed compared to existing gradient-based parameter-efficient tuning and derivative-free optimization methods in few-shot settings.
- Score: 4.926283917321645
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Parameter-efficient tuning methods such as LoRA could achieve comparable
performance to model tuning by tuning a small portion of the parameters.
However, substantial computational resources are still required, as this
process involves calculating gradients and performing back-propagation
throughout the model. Much effort has recently been devoted to utilizing the
derivative-free optimization method to eschew the computation of gradients and
showcase an augmented level of robustness in few-shot settings. In this paper,
we prepend the low-rank modules into each self-attention layer of the model and
employ two derivative-free optimization methods to optimize these low-rank
modules at each layer alternately. Extensive results on various tasks and
language models demonstrate that our proposed method achieves substantial
improvement and exhibits clear advantages in memory usage and convergence speed
compared to existing gradient-based parameter-efficient tuning and
derivative-free optimization methods in few-shot settings.
Related papers
- Compact Model Parameter Extraction via Derivative-Free Optimization [0.0]
Traditionally, parameter extraction is performed manually by dividing the complete set of parameters into smaller subsets.
We employ derivative-free optimization to identify a good parameter set that best fits the compact model without performing an exhaustive number of simulations.
We demonstrate the effectiveness of our methodology by successfully modeling two semiconductor devices.
arXiv Detail & Related papers (2024-06-24T06:52:50Z) - Edge-Efficient Deep Learning Models for Automatic Modulation Classification: A Performance Analysis [0.7428236410246183]
We investigate optimized convolutional neural networks (CNNs) developed for automatic modulation classification (AMC) of wireless signals.
We propose optimized models with the combinations of these techniques to fuse the complementary optimization benefits.
The experimental results show that the proposed individual and combined optimization techniques are highly effective for developing models with significantly less complexity.
arXiv Detail & Related papers (2024-04-11T06:08:23Z) - Simulated Overparameterization [35.12611686956487]
We introduce a novel paradigm called Simulated Overparametrization ( SOP)
SOP proposes a unique approach to model training and inference, where a model with a significantly larger number of parameters is trained in such a way as a smaller, efficient subset of these parameters is used for the actual computation during inference.
We present a novel, architecture agnostic algorithm called "majority kernels", which seamlessly integrates with predominant architectures, including Transformer models.
arXiv Detail & Related papers (2024-02-07T17:07:41Z) - Partial Fine-Tuning: A Successor to Full Fine-Tuning for Vision
Transformers [50.23439411530435]
We show that Partial Fine-Tuning can be an innovative and promising direction capable of concurrently enhancing both efficiency and accuracy.
We propose a novel fine-tuned angle metric to guide the selection of appropriate layers for partial fine-tuning.
Comprehensive experiments on a wide range of datasets and models validate the great potential of partial fine-tuning.
arXiv Detail & Related papers (2023-12-25T10:11:34Z) - Multi-fidelity Constrained Optimization for Stochastic Black Box
Simulators [1.6385815610837167]
We introduce the algorithm Scout-Nd (Stochastic Constrained Optimization for N dimensions) to tackle the issues mentioned earlier.
Scout-Nd efficiently estimates the gradient, reduces the noise of the estimator gradient, and applies multi-fidelity schemes to further reduce computational effort.
We validate our approach on standard benchmarks, demonstrating its effectiveness in optimizing parameters highlighting better performance compared to existing methods.
arXiv Detail & Related papers (2023-11-25T23:36:38Z) - Towards Compute-Optimal Transfer Learning [82.88829463290041]
We argue that zero-shot structured pruning of pretrained models allows them to increase compute efficiency with minimal reduction in performance.
Our results show that pruning convolutional filters of pretrained models can lead to more than 20% performance improvement in low computational regimes.
arXiv Detail & Related papers (2023-04-25T21:49:09Z) - An Empirical Evaluation of Zeroth-Order Optimization Methods on
AI-driven Molecule Optimization [78.36413169647408]
We study the effectiveness of various ZO optimization methods for optimizing molecular objectives.
We show the advantages of ZO sign-based gradient descent (ZO-signGD)
We demonstrate the potential effectiveness of ZO optimization methods on widely used benchmark tasks from the Guacamol suite.
arXiv Detail & Related papers (2022-10-27T01:58:10Z) - Self-Tuning Stochastic Optimization with Curvature-Aware Gradient
Filtering [53.523517926927894]
We explore the use of exact per-sample Hessian-vector products and gradients to construct self-tuning quadratics.
We prove that our model-based procedure converges in noisy gradient setting.
This is an interesting step for constructing self-tuning quadratics.
arXiv Detail & Related papers (2020-11-09T22:07:30Z) - Efficient Learning of Generative Models via Finite-Difference Score
Matching [111.55998083406134]
We present a generic strategy to efficiently approximate any-order directional derivative with finite difference.
Our approximation only involves function evaluations, which can be executed in parallel, and no gradient computations.
arXiv Detail & Related papers (2020-07-07T10:05:01Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.