Related papers: Zeroth-Order Fine-Tuning of LLMs in Random Subspaces

Related papers

$\ abla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space [71.23672814629448]
$nabla$-Reasoner is an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop.<n>$nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark.
arXiv Detail & Related papers (2026-03-05T08:42:54Z)
Zero-Order Optimization for LLM Fine-Tuning via Learnable Direction Sampling [40.94400211806987]
We propose a policy-driven ZO framework that treats the sampling distribution over perturbation directions as a learnable policy.<n>We show that learned sampling improves quality gradient information and relax the explicit dependence on $d$ in convergence bounds.<n>Our results suggest that adaptive direction sampling is a promising route to make ZO fine-tuning viable at scale.
arXiv Detail & Related papers (2026-02-14T08:01:41Z)
Towards Fast LLM Fine-tuning through Zeroth-Order Optimization with Projected Gradient-Aligned Perturbations [23.409093103129706]
Fine-tuning large language models (LLMs) using zeroth-order (ZO) optimization has emerged as a promising alternative to traditional gradient-based methods.<n>Existing ZO methods suffer from high variance in gradient estimation, leading to slow convergence and suboptimal performance on large-scale models.<n>We propose P-GAP, a fast LLM fine-tuning approach through zeroth-order optimization with Projected Gradient-Aligned Perturbations.
arXiv Detail & Related papers (2025-10-21T02:19:11Z)
Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order [38.99428012275441]
Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks.<n>Traditional first-order algorithms incur prohibitive memory and computational costs that scale poorly with model size.<n>We propose zero-order (ZO) optimization methods as a memory- and compute-efficient alternative.
arXiv Detail & Related papers (2025-06-04T20:27:17Z)
LORENZA: Enhancing Generalization in Low-Rank Gradient LLM Training via Efficient Zeroth-Order Adaptive SAM [13.180761892449736]
We study robust parameter-efficient fine-tuning (PEFT) techniques for large-language models (LLMs) We present a new highly computationally efficient framework called AdaZo-SAM, combining Adam and Sharpness-Aware Minimization (SAM) We also design a low-rank gradient optimization method named LORENZA, which is a memory-efficient version of AdaZo-SAM.
arXiv Detail & Related papers (2025-02-26T21:30:34Z)
Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
We propose a new family of algorithms that uses the linear minimization oracle (LMO) to adapt to the geometry of the problem.<n>We demonstrate significant speedups on nanoGPT training using our algorithm, Scion, without any reliance on Adam.
arXiv Detail & Related papers (2025-02-11T13:10:34Z)
I3S: Importance Sampling Subspace Selection for Low-Rank Optimization in LLM Pretraining [50.89661053183944]
Low-rank optimization has emerged as a promising approach to enabling memory-efficient training of large language models (LLMs) Existing low-rank optimization methods typically project gradients onto a low-rank subspace, reducing the memory cost of storing states. We propose importance sampling subspace selection (I3S) for low-rank optimization, which theoretically offers a comparable convergence rate to the dominant subspace approach.
arXiv Detail & Related papers (2025-02-09T06:30:19Z)
Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning [37.507489928116804]
Large language models (LLMs) excel across various tasks, but standard first-order (FO) fine-tuning demands considerable memory. We introduce a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization. We propose textbfDivergence-driven textbfZeroth-textbfOrder (textbfDiZO) optimization.
arXiv Detail & Related papers (2025-02-05T16:03:17Z)
TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs [58.19080159470868]
We propose a novel low-rank ZO estimator, TeZO, which captures the low-rankness across both the model and temporal dimension. Specifically, we represent ZO perturbations along the temporal dimension as a 3D tensor and employ Canonical Polyadic Decomposition (CPD) to extract each low-rank 2D matrix.
arXiv Detail & Related papers (2025-01-31T11:34:03Z)
HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-tuning LLM with Zeroth-order Optimization [18.00873866263434]
Fine-tuning large language models (LLMs) poses significant memory challenges. Recent work, MeZO, addresses this issue using a zeroth-order (ZO) optimization method. We introduce HELENE, a novel scalable and memory-efficient pre-conditioner.
arXiv Detail & Related papers (2024-11-16T04:27:22Z)
Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models [33.911521719528686]
Fine-tuning is powerful for adapting large language models to downstream tasks, but it often results in huge memory usages. A promising approach is using Zeroth-Order (ZO) gradients, which estimates to replace First-Order (FO) gradients. We introduce a novel layer-wise sparse computation and memory efficient ZO, named LeZO.
arXiv Detail & Related papers (2024-10-13T12:47:37Z)
Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures [21.18741772731095]
Zeroth-order (ZO) algorithms offer a promising alternative by approximating gradients using finite differences of function values. Existing ZO methods struggle to capture the low-rank gradient structure common in LLM fine-tuning, leading to suboptimal performance. This paper proposes a low-rank ZO algorithm (LOZO) that effectively captures this structure in LLMs.
arXiv Detail & Related papers (2024-10-10T08:10:53Z)
AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning [22.950914612765494]
Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks. Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph. We propose the Adaptive Zeroth-order-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods.
arXiv Detail & Related papers (2024-06-26T04:33:13Z)
Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models. The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models. Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z)
Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark [166.40879020706151]
This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during fine-tuning. Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques. Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance.
arXiv Detail & Related papers (2024-02-18T14:08:48Z)
Differentially Private Zeroth-Order Methods for Scalable Large Language Model Finetuning [0.0]
differentially private (DP) fine-tuning of pretrained LLMs has been widely used to safeguarding the privacy of task-specific datasets.<n>Despite pushing the scalability of DP-SGD to its limit, DP-SGD-based fine-tuning methods are unfortunately limited by the inherent inefficiency of SGD.
arXiv Detail & Related papers (2024-02-12T17:24:15Z)
Scaling Sparse Fine-Tuning to Large Language Models [67.59697720719672]
Large Language Models (LLMs) are difficult to fully fine-tune due to their sheer number of parameters. We propose SpIEL, a novel sparse finetuning method which maintains an array of parameter indices and the deltas of these parameters relative to their pretrained values. We show that SpIEL is superior to popular parameter-efficient fine-tuning methods like LoRA in terms of performance and comparable in terms of run time.
arXiv Detail & Related papers (2024-01-29T18:43:49Z)
AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models. AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.