Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM
Fine-Tuning
- URL: http://arxiv.org/abs/2402.15751v1
- Date: Sat, 24 Feb 2024 07:22:04 GMT
- Title: Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM
Fine-Tuning
- Authors: Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh and Yang
You
- Abstract summary: This paper introduces Sparse MeZO, a memory-efficient zeroth-order optimization approach that applies ZO only to a carefully chosen subset of parameters.
We show that Sparse-MeZO consistently improves both performance and convergence speed over MeZO without any overhead.
- Score: 67.44661423463927
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While fine-tuning large language models (LLMs) for specific tasks often
yields impressive results, it comes at the cost of memory inefficiency due to
back-propagation in gradient-based training. Memory-efficient Zeroth-order
(MeZO) optimizers, recently proposed to address this issue, only require
forward passes during training, making them more memory-friendly. However, the
quality of gradient estimates in zeroth order optimization often depends on the
data dimensionality, potentially explaining why MeZO still exhibits significant
performance drops compared to standard fine-tuning across various tasks.
Inspired by the success of Parameter-Efficient Fine-Tuning (PEFT), this paper
introduces Sparse MeZO, a novel memory-efficient zeroth-order optimization
approach that applies ZO only to a carefully chosen subset of parameters. We
propose a simple yet effective parameter selection scheme that yields
significant performance gains with Sparse-MeZO. Additionally, we develop a
memory-optimized implementation for sparse masking, ensuring the algorithm
requires only inference-level memory consumption, allowing Sparse-MeZO to
fine-tune LLaMA-30b on a single A100 GPU. Experimental results illustrate that
Sparse-MeZO consistently improves both performance and convergence speed over
MeZO without any overhead. For example, it achieves a 9\% absolute accuracy
improvement and 3.5x speedup over MeZO on the RTE task.
Related papers
- Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models [33.911521719528686]
Fine-tuning is powerful for adapting large language models to downstream tasks, but it often results in huge memory usages.
A promising approach is using Zeroth-Order (ZO) gradients, which estimates to replace First-Order (FO) gradients.
We introduce a novel layer-wise sparse computation and memory efficient ZO, named LeZO.
arXiv Detail & Related papers (2024-10-13T12:47:37Z) - Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [66.27334633749734]
As language models grow in size, memory demands for backpropagation increase.
Zeroth-order (ZOZO) optimization methods offer a memory-efficient alternative.
We show that SubZero enhances fine-tuning and achieves faster results compared to standard ZOZO approaches.
arXiv Detail & Related papers (2024-10-11T17:01:43Z) - Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models [35.84667536915878]
This paper introduces Addax, a novel method that improves both memory efficiency and performance of IP-SGD by integrating it with MeZO.
In our experiments, Addax consistently outperforms MeZO regarding accuracy and convergence speed while having a comparable memory footprint.
arXiv Detail & Related papers (2024-10-09T00:49:08Z) - Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity [66.67596152389591]
Zeroth-order optimization (ZO) is a memory-efficient strategy for fine-tuning Large Language Models.
In this study, we investigate the feasibility of fine-tuning an extremely small subset of LLM parameters using ZO.
Our results demonstrate that fine-tuning 0.1% sensitive parameters in the LLM with ZO can outperform the full ZO fine-tuning performance.
arXiv Detail & Related papers (2024-06-05T04:07:35Z) - Variance-reduced Zeroth-Order Methods for Fine-Tuning Language Models [17.027512781038617]
Zeroth-order (ZO) optimization methods can leverage memory-efficient forward passes to estimate.
MeZO, an adaptation of ZO-SGD, has been shown to consistently outperform zero-shot and in-context learning.
MeZO-SVRG significantly reduces the required memory footprint compared to first-order SGD.
arXiv Detail & Related papers (2024-04-11T18:35:49Z) - Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark [166.40879020706151]
This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during fine-tuning.
Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques.
Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance.
arXiv Detail & Related papers (2024-02-18T14:08:48Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Fine-Tuning Language Models with Just Forward Passes [92.04219196752007]
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a large amount of memory.
We propose a memory-efficient zerothorder (MeZO) to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference.
arXiv Detail & Related papers (2023-05-27T02:28:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.