Fine-Tuning Language Models with Just Forward Passes
- URL: http://arxiv.org/abs/2305.17333v3
- Date: Thu, 11 Jan 2024 13:56:52 GMT
- Title: Fine-Tuning Language Models with Just Forward Passes
- Authors: Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D.
Lee, Danqi Chen, Sanjeev Arora
- Abstract summary: Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a large amount of memory.
We propose a memory-efficient zerothorder (MeZO) to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference.
- Score: 92.04219196752007
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-tuning language models (LMs) has yielded success on diverse downstream
tasks, but as LMs grow in size, backpropagation requires a prohibitively large
amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients
using only two forward passes but are theorized to be catastrophically slow for
optimizing large models. In this work, we propose a memory-efficient
zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate
in-place, thereby fine-tuning LMs with the same memory footprint as inference.
For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter
model, whereas fine-tuning with backpropagation can train only a 2.7B LM with
the same budget. We conduct comprehensive experiments across model types
(masked and autoregressive LMs), model scales (up to 66B), and downstream tasks
(classification, multiple-choice, and generation). Our results demonstrate that
(1) MeZO significantly outperforms in-context learning and linear probing; (2)
MeZO achieves comparable performance to fine-tuning with backpropagation across
multiple tasks, with up to 12x memory reduction and up to 2x GPU-hour reduction
in our implementation; (3) MeZO is compatible with both full-parameter and
parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO
can effectively optimize non-differentiable objectives (e.g., maximizing
accuracy or F1). We support our empirical findings with theoretical insights,
highlighting how adequate pre-training and task prompts enable MeZO to
fine-tune huge models, despite classical ZO analyses suggesting otherwise.
Related papers
- Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models [33.911521719528686]
Fine-tuning is powerful for adapting large language models to downstream tasks, but it often results in huge memory usages.
A promising approach is using Zeroth-Order (ZO) gradients, which estimates to replace First-Order (FO) gradients.
We introduce a novel layer-wise sparse computation and memory efficient ZO, named LeZO.
arXiv Detail & Related papers (2024-10-13T12:47:37Z) - Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [66.27334633749734]
As language models grow in size, memory demands for backpropagation increase.
Zeroth-order (ZOZO) optimization methods offer a memory-efficient alternative.
We show that SubZero enhances fine-tuning and achieves faster results compared to standard ZOZO approaches.
arXiv Detail & Related papers (2024-10-11T17:01:43Z) - Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines [17.539008562641303]
Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers.
Next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data.
Fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands.
arXiv Detail & Related papers (2024-09-23T20:14:09Z) - AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning [22.950914612765494]
Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks.
Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph.
We propose the Adaptive Zeroth-order-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods.
arXiv Detail & Related papers (2024-06-26T04:33:13Z) - Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models.
We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.
Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - Variance-reduced Zeroth-Order Methods for Fine-Tuning Language Models [17.027512781038617]
Zeroth-order (ZO) optimization methods can leverage memory-efficient forward passes to estimate.
MeZO, an adaptation of ZO-SGD, has been shown to consistently outperform zero-shot and in-context learning.
MeZO-SVRG significantly reduces the required memory footprint compared to first-order SGD.
arXiv Detail & Related papers (2024-04-11T18:35:49Z) - Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM
Fine-Tuning [67.44661423463927]
This paper introduces Sparse MeZO, a memory-efficient zeroth-order optimization approach that applies ZO only to a carefully chosen subset of parameters.
We show that Sparse-MeZO consistently improves both performance and convergence speed over MeZO without any overhead.
arXiv Detail & Related papers (2024-02-24T07:22:04Z) - Scaling Sparse Fine-Tuning to Large Language Models [67.59697720719672]
Large Language Models (LLMs) are difficult to fully fine-tune due to their sheer number of parameters.
We propose SpIEL, a novel sparse finetuning method which maintains an array of parameter indices and the deltas of these parameters relative to their pretrained values.
We show that SpIEL is superior to popular parameter-efficient fine-tuning methods like LoRA in terms of performance and comparable in terms of run time.
arXiv Detail & Related papers (2024-01-29T18:43:49Z) - CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference.
We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch.
We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.