Low-Rank Curvature for Zeroth-Order Optimization in LLM Fine-Tuning
- URL: http://arxiv.org/abs/2511.07971v1
- Date: Wed, 12 Nov 2025 01:31:50 GMT
- Title: Low-Rank Curvature for Zeroth-Order Optimization in LLM Fine-Tuning
- Authors: Hyunseok Seung, Jaewoo Lee, Hyunsuk Ko,
- Abstract summary: We introduce LOREN, a curvature-aware zeroth-order (ZO) optimization method for fine-tuning large language models (LLMs)<n>Existing ZO methods, which estimate gradients via finite differences using random perturbations, often suffer from high variance and suboptimal search directions.<n>Our approach addresses these challenges by: (i) adaptively estimating an anisotropic perturbation distribution for gradient estimation, (ii) capturing curvature through a low-rank block diagonal preconditioner, and (iii) applying a REINFORCE leave-one-out (RLOO) gradient estimator to reduce variance.
- Score: 8.349781300731225
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce LOREN, a curvature-aware zeroth-order (ZO) optimization method for fine-tuning large language models (LLMs). Existing ZO methods, which estimate gradients via finite differences using random perturbations, often suffer from high variance and suboptimal search directions. Our approach addresses these challenges by: (i) reformulating the problem of gradient preconditioning as that of adaptively estimating an anisotropic perturbation distribution for gradient estimation, (ii) capturing curvature through a low-rank block diagonal preconditioner using the framework of natural evolution strategies, and (iii) applying a REINFORCE leave-one-out (RLOO) gradient estimator to reduce variance. Experiments on standard LLM benchmarks show that our method outperforms state-of-the-art ZO methods by achieving higher accuracy and faster convergence, while cutting peak memory usage by up to 27.3% compared with MeZO-Adam.
Related papers
- Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization [40.95701844244596]
We show that ZO optimization can be substantially improved by unifying two complementary principles.<n>We instantiate in a new method, ZO-Muon, admitting a natural interpretation as a low-rank Muon in the ZO setting.
arXiv Detail & Related papers (2026-02-19T08:08:33Z) - SVD-Preconditioned Gradient Descent Method for Solving Nonlinear Least Squares Problems [27.21342746802453]
This paper introduces a novel optimization algorithm designed for nonlinear least-squares problems.<n>The method is derived by preconditioning the gradient descent direction using the Singular Value Decomposition (SVD) of the Jacobian.<n>We establish the local linear convergence of the proposed method under standard regularity assumptions and prove global convergence for a modified version of the algorithm under suitable conditions.
arXiv Detail & Related papers (2026-02-07T18:53:00Z) - Prior-Informed Zeroth-Order Optimization with Adaptive Direction Alignment for Memory-Efficient LLM Fine-Tuning [4.278794376089146]
We propose a plug-and-play method that incorporates prior-informed perturbations to refine gradient estimation.<n>Our method significantly accelerates convergence compared to standard ZO approaches.<n>We prove that our gradient estimator achieves stronger alignment with the true gradient direction.
arXiv Detail & Related papers (2026-01-08T08:27:15Z) - Tuning-Free Sampling via Optimization on the Space of Probability Measures [14.429970585649945]
tuning-free sampling algorithms for gradient-based sampling algorithms obtained as time-discretizations of Wasserstein gradient flows.<n>We establish strong theoretical guarantees for our approach. In particular, we recover the convergence rate of optimally tuned versions of these algorithms up to logarithmic factors.<n>Across a variety of tasks, our algorithms achieve similar performance to the optimal performance of existing algorithms, with no need to tune a step size parameter.
arXiv Detail & Related papers (2025-10-29T09:29:39Z) - On the Optimal Construction of Unbiased Gradient Estimators for Zeroth-Order Optimization [57.179679246370114]
A potential limitation of existing methods is the bias inherent in most perturbation estimators unless a stepsize is proposed.<n>We propose a novel family of unbiased gradient scaling estimators that eliminate bias while maintaining favorable construction.
arXiv Detail & Related papers (2025-10-22T18:25:43Z) - Towards Fast LLM Fine-tuning through Zeroth-Order Optimization with Projected Gradient-Aligned Perturbations [23.409093103129706]
Fine-tuning large language models (LLMs) using zeroth-order (ZO) optimization has emerged as a promising alternative to traditional gradient-based methods.<n>Existing ZO methods suffer from high variance in gradient estimation, leading to slow convergence and suboptimal performance on large-scale models.<n>We propose P-GAP, a fast LLM fine-tuning approach through zeroth-order optimization with Projected Gradient-Aligned Perturbations.
arXiv Detail & Related papers (2025-10-21T02:19:11Z) - From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models [90.45197506653341]
Large reasoning models generate intermediate reasoning traces before producing final answers.<n> aligning LRMs with human preferences, a crucial prerequisite for model deployment, remains underexplored.<n>A common workaround optimized a single sampled trajectory, which introduces substantial gradient variance from trace sampling.
arXiv Detail & Related papers (2025-10-06T17:58:01Z) - Gradient Normalization Provably Benefits Nonconvex SGD under Heavy-Tailed Noise [60.92029979853314]
We investigate the roles of gradient normalization and clipping in ensuring the convergence of Gradient Descent (SGD) under heavy-tailed noise.
Our work provides the first theoretical evidence demonstrating the benefits of gradient normalization in SGD under heavy-tailed noise.
We introduce an accelerated SGD variant incorporating gradient normalization and clipping, further enhancing convergence rates under heavy-tailed noise.
arXiv Detail & Related papers (2024-10-21T22:40:42Z) - Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [63.10833446782114]
As language models grow in size, memory demands for backpropagation increase.<n>Zeroth-order (ZO) optimization methods offer a memory-efficient alternative.<n>In this paper, we propose Subspace Zero-order optimization to address the challenges posed by posed by high dimensionality perturbations.
arXiv Detail & Related papers (2024-10-11T17:01:43Z) - Differentially Private Optimization with Sparse Gradients [60.853074897282625]
We study differentially private (DP) optimization problems under sparsity of individual gradients.
Building on this, we obtain pure- and approximate-DP algorithms with almost optimal rates for convex optimization with sparse gradients.
arXiv Detail & Related papers (2024-04-16T20:01:10Z) - Signal Processing Meets SGD: From Momentum to Filter [6.751292200515355]
In deep learning, gradient descent (SGD) and its momentum-based variants are widely used for optimization.<n>In this paper, we analyze gradient behavior through a signal processing lens, isolating key factors that influence updates.<n>We introduce a novel method SGDF based on Wiener Filter principles, which derives an optimal time-varying gain to refine updates.
arXiv Detail & Related papers (2023-11-06T01:41:46Z) - Model-Based Reparameterization Policy Gradient Methods: Theory and
Practical Algorithms [88.74308282658133]
Reization (RP) Policy Gradient Methods (PGMs) have been widely adopted for continuous control tasks in robotics and computer graphics.
Recent studies have revealed that, when applied to long-term reinforcement learning problems, model-based RP PGMs may experience chaotic and non-smooth optimization landscapes.
We propose a spectral normalization method to mitigate the exploding variance issue caused by long model unrolls.
arXiv Detail & Related papers (2023-10-30T18:43:21Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.