Related papers: Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization

Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization

URL: http://arxiv.org/abs/2602.17155v2
Date: Sat, 21 Feb 2026 09:07:11 GMT
Title: Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization
Authors: Yicheng Lang, Changsheng Wang, Yihua Zhang, Mingyi Hong, Zheng Zhang, Wotao Yin, Sijia Liu,
Abstract summary: We show that ZO optimization can be substantially improved by unifying two complementary principles.<n>We instantiate in a new method, ZO-Muon, admitting a natural interpretation as a low-rank Muon in the ZO setting.
Score: 40.95701844244596
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Zeroth-order (ZO) optimization provides a gradient-free alternative to first-order (FO) methods by estimating gradients via finite differences of function evaluations, and has recently emerged as a memory-efficient paradigm for fine-tuning large-scale models by avoiding backpropagation. However, ZO optimization has a fundamental tension between accuracy and query efficiency. In this work, we show that ZO optimization can be substantially improved by unifying two complementary principles: (i) a projection-based subspace view that reduces gradient estimation variance by exploiting the intrinsic low-rank structure of model updates, and (ii) Muon-style spectral optimization that applies gradient orthogonalization to extract informative spectral structure from noisy ZO gradients. These findings form a unified framework of subspace gradient orthogonalization, which we instantiate in a new method, ZO-Muon, admitting a natural interpretation as a low-rank Muon optimizer in the ZO setting. Extensive experiments on large language models (LLMs) and vision transformers (ViTs) demonstrate that ZO-Muon significantly accelerates convergence and achieves a win-win improvement in accuracy and query/runtime efficiency. Notably, compared to the popular MeZO baseline, ZO-Muon requires only 24.7% of the queries to reach the same SST-2 performance for LLM fine-tuning, and improves accuracy by 25.1% on ViT-B fine-tuning on CIFAR-100.

Related papers

Prior-Informed Zeroth-Order Optimization with Adaptive Direction Alignment for Memory-Efficient LLM Fine-Tuning [4.278794376089146]
We propose a plug-and-play method that incorporates prior-informed perturbations to refine gradient estimation.<n>Our method significantly accelerates convergence compared to standard ZO approaches.<n>We prove that our gradient estimator achieves stronger alignment with the true gradient direction.
arXiv Detail & Related papers (2026-01-08T08:27:15Z)
Low-Rank Curvature for Zeroth-Order Optimization in LLM Fine-Tuning [8.349781300731225]
We introduce LOREN, a curvature-aware zeroth-order (ZO) optimization method for fine-tuning large language models (LLMs)<n>Existing ZO methods, which estimate gradients via finite differences using random perturbations, often suffer from high variance and suboptimal search directions.<n>Our approach addresses these challenges by: (i) adaptively estimating an anisotropic perturbation distribution for gradient estimation, (ii) capturing curvature through a low-rank block diagonal preconditioner, and (iii) applying a REINFORCE leave-one-out (RLOO) gradient estimator to reduce variance.
arXiv Detail & Related papers (2025-11-11T08:34:09Z)
Towards Fast LLM Fine-tuning through Zeroth-Order Optimization with Projected Gradient-Aligned Perturbations [23.409093103129706]
Fine-tuning large language models (LLMs) using zeroth-order (ZO) optimization has emerged as a promising alternative to traditional gradient-based methods.<n>Existing ZO methods suffer from high variance in gradient estimation, leading to slow convergence and suboptimal performance on large-scale models.<n>We propose P-GAP, a fast LLM fine-tuning approach through zeroth-order optimization with Projected Gradient-Aligned Perturbations.
arXiv Detail & Related papers (2025-10-21T02:19:11Z)
KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning [15.81250204481401]
We introduce a kernel-function-based ZO framework aimed at mitigating gradient estimation bias.<n>KerZOO achieves comparable or superior performance to existing ZO baselines.<n>We show that the kernel function is an effective avenue for reducing estimation bias in ZO methods.
arXiv Detail & Related papers (2025-05-24T21:56:03Z)
Refining Adaptive Zeroth-Order Optimization at Ease [24.327161891577727]
This paper introduces Refined Adaptive Zeroth-Order Optimization (R-AdaZO)<n>We first show the untapped variance reduction effect of first moment estimate on ZO gradient estimation.<n>We then refine the second moment estimate based on these variance-reduced gradient estimates to better capture the geometry of the optimization landscape.
arXiv Detail & Related papers (2025-02-03T03:10:44Z)
TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs [58.19080159470868]
We propose a novel low-rank ZO estimator, TeZO, which captures the low-rankness across both the model and temporal dimension.<n>Specifically, we represent ZO perturbations along the temporal dimension as a 3D tensor and employ Canonical Polyadic Decomposition (CPD) to extract each low-rank 2D matrix.
arXiv Detail & Related papers (2025-01-31T11:34:03Z)
Gradient Normalization Provably Benefits Nonconvex SGD under Heavy-Tailed Noise [60.92029979853314]
We investigate the roles of gradient normalization and clipping in ensuring the convergence of Gradient Descent (SGD) under heavy-tailed noise. Our work provides the first theoretical evidence demonstrating the benefits of gradient normalization in SGD under heavy-tailed noise. We introduce an accelerated SGD variant incorporating gradient normalization and clipping, further enhancing convergence rates under heavy-tailed noise.
arXiv Detail & Related papers (2024-10-21T22:40:42Z)
Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [63.10833446782114]
As language models grow in size, memory demands for backpropagation increase.<n>Zeroth-order (ZO) optimization methods offer a memory-efficient alternative.<n>In this paper, we propose Subspace Zero-order optimization to address the challenges posed by posed by high dimensionality perturbations.
arXiv Detail & Related papers (2024-10-11T17:01:43Z)
Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information. We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z)
Self-Tuning Stochastic Optimization with Curvature-Aware Gradient Filtering [53.523517926927894]
We explore the use of exact per-sample Hessian-vector products and gradients to construct self-tuning quadratics. We prove that our model-based procedure converges in noisy gradient setting. This is an interesting step for constructing self-tuning quadratics.
arXiv Detail & Related papers (2020-11-09T22:07:30Z)
A Primer on Zeroth-Order Optimization in Signal Processing and Machine Learning [95.85269649177336]
ZO optimization iteratively performs three major steps: gradient estimation, descent direction, and solution update. We demonstrate promising applications of ZO optimization, such as evaluating and generating explanations from black-box deep learning models, and efficient online sensor management.
arXiv Detail & Related papers (2020-06-11T06:50:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.