Robust and Efficient Zeroth-Order LLM Fine-Tuning via Adaptive Bayesian Subspace Optimizer
- URL: http://arxiv.org/abs/2601.01452v2
- Date: Mon, 12 Jan 2026 11:20:01 GMT
- Title: Robust and Efficient Zeroth-Order LLM Fine-Tuning via Adaptive Bayesian Subspace Optimizer
- Authors: Jian Feng, Zhihong Huang,
- Abstract summary: Fine-tuning large language models (LLMs) with zeroth-order (ZO) optimization reduces memory by approximating gradients through function evaluations.<n>We introduce BSZO, an adaptive textbfBayesian textbfSubspace textbfZeroth-Order textbfOptimizer.
- Score: 4.6561758107970395
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-tuning large language models (LLMs) with zeroth-order (ZO) optimization reduces memory by approximating gradients through function evaluations. However, existing methods essentially perform updates in a one-dimensional space, and suffer from collapse or substantial performance degradation under low-precision training. We introduce BSZO, an adaptive \textbf{B}ayesian \textbf{S}ubspace \textbf{Z}eroth-Order \textbf{O}ptimizer, which applies Kalman filtering to combine finite-difference information across multiple perturbation directions within a subspace. By treating each finite-difference measurement as a noisy observation, BSZO builds a posterior distribution over the subspace-projected gradient and updates it through Bayesian inference, with a residual-based adaptive mechanism to adapt to noise variations. Theoretical analysis shows that BSZO improves the convergence rate by a factor of $k/γ$ compared to standard ZO methods. Experiments on RoBERTa, Mistral, and OPT models show that BSZO outperforms the baselines across various tasks, achieving up to 6.67\% absolute average improvement on OPT-13B while remaining robust under fp16/bf16 precision and keeping memory usage close to inference-only baselines (1.00$\times$--1.08$\times$ of MeZO).
Related papers
- Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization [40.95701844244596]
We show that ZO optimization can be substantially improved by unifying two complementary principles.<n>We instantiate in a new method, ZO-Muon, admitting a natural interpretation as a low-rank Muon in the ZO setting.
arXiv Detail & Related papers (2026-02-19T08:08:33Z) - Zero-Order Optimization for LLM Fine-Tuning via Learnable Direction Sampling [40.94400211806987]
We propose a policy-driven ZO framework that treats the sampling distribution over perturbation directions as a learnable policy.<n>We show that learned sampling improves quality gradient information and relax the explicit dependence on $d$ in convergence bounds.<n>Our results suggest that adaptive direction sampling is a promising route to make ZO fine-tuning viable at scale.
arXiv Detail & Related papers (2026-02-14T08:01:41Z) - AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning [8.698253005940503]
We propose Activation-Guided Zeroth-Order optimization (AGZO)<n>Unlike prior methods, AGZO extracts a compact, activation-informed subspace on the fly during the forward pass and restricts perturbations to this low-rank subspace.<n>AGZO consistently outperforms state-of-the-art ZO baselines and significantly narrows the performance gap with first-order fine-tuning.
arXiv Detail & Related papers (2026-01-24T02:28:15Z) - Low-Rank Curvature for Zeroth-Order Optimization in LLM Fine-Tuning [8.349781300731225]
We introduce LOREN, a curvature-aware zeroth-order (ZO) optimization method for fine-tuning large language models (LLMs)<n>Existing ZO methods, which estimate gradients via finite differences using random perturbations, often suffer from high variance and suboptimal search directions.<n>Our approach addresses these challenges by: (i) adaptively estimating an anisotropic perturbation distribution for gradient estimation, (ii) capturing curvature through a low-rank block diagonal preconditioner, and (iii) applying a REINFORCE leave-one-out (RLOO) gradient estimator to reduce variance.
arXiv Detail & Related papers (2025-11-11T08:34:09Z) - Estimation of Toeplitz Covariance Matrices using Overparameterized Gradient Descent [1.7188280334580195]
We revisit Toeplitz covariance estimation through the lens of simple descent (GD)<n>We show that when $K = P$, GD may converge to suboptimal solutions.<n>We propose an accelerated GD variant with separate learning rates for amplitudes and frequencies.
arXiv Detail & Related papers (2025-11-03T14:07:53Z) - On the Optimal Construction of Unbiased Gradient Estimators for Zeroth-Order Optimization [57.179679246370114]
A potential limitation of existing methods is the bias inherent in most perturbation estimators unless a stepsize is proposed.<n>We propose a novel family of unbiased gradient scaling estimators that eliminate bias while maintaining favorable construction.
arXiv Detail & Related papers (2025-10-22T18:25:43Z) - Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling [70.8832906871441]
We study how to steer generation toward desired rewards without retraining the models.<n>Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement.<n>We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity.
arXiv Detail & Related papers (2025-07-11T08:00:47Z) - Elucidating Subspace Perturbation in Zeroth-Order Optimization: Theory and Practice at Scale [33.38543010618118]
Zeroth-order (ZO) optimization has emerged as a promising alternative to gradient-based backpropagation methods.<n>We show that high dimensionality is the primary bottleneck and introduce the notion of textitsubspace alignment to explain how the subspace perturbations reduce gradient noise and accelerate convergence.<n>We propose an efficient ZO method using block coordinate descent (MeZO-BCD), which perturbs and updates only a subset of parameters at each step.
arXiv Detail & Related papers (2025-01-31T12:46:04Z) - ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.<n>Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z) - Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [63.10833446782114]
As language models grow in size, memory demands for backpropagation increase.<n>Zeroth-order (ZO) optimization methods offer a memory-efficient alternative.<n>In this paper, we propose Subspace Zero-order optimization to address the challenges posed by posed by high dimensionality perturbations.
arXiv Detail & Related papers (2024-10-11T17:01:43Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - ROOT-SGD: Sharp Nonasymptotics and Near-Optimal Asymptotics in a Single Algorithm [71.13558000599839]
We study the problem of solving strongly convex and smooth unconstrained optimization problems using first-order algorithms.
We devise a novel, referred to as Recursive One-Over-T SGD, based on an easily implementable, averaging of past gradients.
We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an sense.
arXiv Detail & Related papers (2020-08-28T14:46:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.