Related papers: Hi-ZFO: Hierarchical Zeroth- and First-Order LLM Fine-Tuning via Importance-Guided Tensor Selection

Hi-ZFO: Hierarchical Zeroth- and First-Order LLM Fine-Tuning via Importance-Guided Tensor Selection

URL: http://arxiv.org/abs/2601.05501v1
Date: Fri, 09 Jan 2026 03:20:54 GMT
Title: Hi-ZFO: Hierarchical Zeroth- and First-Order LLM Fine-Tuning via Importance-Guided Tensor Selection
Authors: Feihu Jin, Ying Tan,
Abstract summary: We propose textbfHi-ZFO (textbfHierarchical textbfZeroth- and textbfFirst-textbfOrder optimization) to synergize FO gradients with ZO estimation.<n>We show that Hi-ZFO consistently achieves superior performance while significantly reducing the training time.
Score: 4.808936079900314
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Fine-tuning large language models (LLMs) using standard first-order (FO) optimization often drives training toward sharp, poorly generalizing minima. Conversely, zeroth-order (ZO) methods offer stronger exploratory behavior without relying on explicit gradients, yet suffer from slow convergence. More critically, our analysis reveals that in generative tasks, the vast output and search space significantly amplify estimation variance, rendering ZO methods both noisy and inefficient. To address these challenges, we propose \textbf{Hi-ZFO} (\textbf{Hi}erarchical \textbf{Z}eroth- and \textbf{F}irst-\textbf{O}rder optimization), a hybrid framework designed to synergize the precision of FO gradients with the exploratory capability of ZO estimation. Hi-ZFO adaptively partitions the model through layer-wise importance profiling, applying precise FO updates to critical layers while leveraging ZO optimization for less sensitive ones. Notably, ZO in Hi-ZFO is not merely a memory-saving surrogate; it is intentionally introduced as a source of "beneficial stochasticity" to help the model escape the local minima where pure FO optimization tends to stagnate. Validated across diverse generative, mathematical, and code reasoning tasks, Hi-ZFO consistently achieves superior performance while significantly reducing the training time. These results demonstrate the effectiveness of hierarchical hybrid optimization for LLM fine-tuning.

Related papers

LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models [48.68246945083386]
Likelihood-Free Policy Optimization (LFPO) is a native framework that maps the concept of vector field flow matching to the discrete token space.<n>LFPO formulates alignment as geometric velocity rectification, which directly optimize denoising logits via contrastive updates.<n>Experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.
arXiv Detail & Related papers (2026-03-02T07:42:55Z)
Prior-Informed Zeroth-Order Optimization with Adaptive Direction Alignment for Memory-Efficient LLM Fine-Tuning [4.278794376089146]
We propose a plug-and-play method that incorporates prior-informed perturbations to refine gradient estimation.<n>Our method significantly accelerates convergence compared to standard ZO approaches.<n>We prove that our gradient estimator achieves stronger alignment with the true gradient direction.
arXiv Detail & Related papers (2026-01-08T08:27:15Z)
Divergence Minimization Preference Optimization for Diffusion Model Alignment [66.31417479052774]
Divergence Minimization Preference Optimization (DMPO) is a principled method for aligning diffusion models by minimizing reverse KL divergence.<n>DMPO can consistently outperform or match existing techniques across different base models and test sets.
arXiv Detail & Related papers (2025-07-10T07:57:30Z)
Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning [44.907586955452295]
Large language models (LLMs) excel across various tasks, but standard first-order (FO) fine-tuning demands considerable memory.<n>Recently, zeroth-order (ZO) optimization stood out as a promising memory-efficient training paradigm.<n>We introduce a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization.<n>Our results demonstrate that DiZO significantly reduces the needed iterations for convergence without sacrificing throughput.
arXiv Detail & Related papers (2025-02-05T16:03:17Z)
Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [63.10833446782114]
As language models grow in size, memory demands for backpropagation increase.<n>Zeroth-order (ZO) optimization methods offer a memory-efficient alternative.<n>In this paper, we propose Subspace Zero-order optimization to address the challenges posed by posed by high dimensionality perturbations.
arXiv Detail & Related papers (2024-10-11T17:01:43Z)
Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures [21.18741772731095]
Zeroth-order (ZO) algorithms offer a promising alternative by approximating gradients using finite differences of function values. Existing ZO methods struggle to capture the low-rank gradient structure common in LLM fine-tuning, leading to suboptimal performance. This paper proposes a low-rank ZO algorithm (LOZO) that effectively captures this structure in LLMs.
arXiv Detail & Related papers (2024-10-10T08:10:53Z)
AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning [22.950914612765494]
Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks.<n>Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph.<n>We propose the Adaptive Zeroth-order-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods.
arXiv Detail & Related papers (2024-06-26T04:33:13Z)
Discovering Preference Optimization Algorithms with and for Large Language Models [50.843710797024805]
offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs. We perform objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention. Experiments demonstrate the state-of-the-art performance of DiscoPOP, a novel algorithm that adaptively blends logistic and exponential losses.
arXiv Detail & Related papers (2024-06-12T16:58:41Z)
Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark [166.40879020706151]
This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during fine-tuning. Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques. Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance.
arXiv Detail & Related papers (2024-02-18T14:08:48Z)
Conditional Denoising Diffusion for Sequential Recommendation [62.127862728308045]
Two prominent generative models, Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs) GANs suffer from unstable optimization, while VAEs are prone to posterior collapse and over-smoothed generations. We present a conditional denoising diffusion model, which includes a sequence encoder, a cross-attentive denoising decoder, and a step-wise diffuser.
arXiv Detail & Related papers (2023-04-22T15:32:59Z)
Debiasing a First-order Heuristic for Approximate Bi-level Optimization [38.068090269482425]
Approximate bi-level optimization (ABLO) consists of (outer-level) optimization problems, involving numerical (inner-level) optimization loops. There is a lack of theoretical understanding of FOM's convergence properties. We propose an unbiased FOM enjoying constant memory complexity as a function of $r$.
arXiv Detail & Related papers (2021-06-04T13:46:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.