KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning
- URL: http://arxiv.org/abs/2505.18886v1
- Date: Sat, 24 May 2025 21:56:03 GMT
- Title: KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning
- Authors: Zhendong Mi, Qitao Tan, Xiaodong Yu, Zining Zhu, Geng Yuan, Shaoyi Huang,
- Abstract summary: We introduce a kernel-function-based ZO framework aimed at mitigating gradient estimation bias.<n>KerZOO achieves comparable or superior performance to existing ZO baselines.<n>We show that the kernel function is an effective avenue for reducing estimation bias in ZO methods.
- Score: 15.81250204481401
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated impressive capabilities across numerous NLP tasks. Nevertheless, conventional first-order fine-tuning techniques impose heavy memory demands, creating practical obstacles to real-world applications. Zeroth-order (ZO) optimization has recently emerged as a promising memory-efficient alternative, as it circumvents the need for backpropagation by estimating gradients solely through forward passes--making it particularly suitable for resource-limited environments. Despite its efficiency, ZO optimization suffers from gradient estimation bias, which significantly hinders convergence speed. To address this, we analytically identify and characterize the lower-order bias introduced during ZO-based gradient estimation in LLM fine-tuning. Motivated by tools in mathematical physics, we introduce a kernel-function-based ZO framework aimed at mitigating this bias and improving optimization stability. KerZOO achieves comparable or superior performance to existing ZO baselines in both full-parameter and parameter-efficient fine-tuning settings of LLMs, while significantly reducing the number of iterations required to reach convergence. For example, KerZOO reduces total GPU training hours by as much as 74% and 44% on WSC and MultiRC datasets in fine-tuning OPT-2.7B model and can exceed the MeZO baseline by 2.9% and 2.6% in accuracy. We show that the kernel function is an effective avenue for reducing estimation bias in ZO methods.
Related papers
- Dynamic Noise Preference Optimization for LLM Self-Improvement via Synthetic Data [51.62162460809116]
We introduce Dynamic Noise Preference Optimization (DNPO) to ensure consistent improvements across iterations.<n>In experiments with Zephyr-7B, DNPO consistently outperforms existing methods, showing an average performance boost of 2.6%.<n> DNPO shows a significant improvement in model-generated data quality, with a 29.4% win-loss rate gap compared to the baseline in GPT-4 evaluations.
arXiv Detail & Related papers (2025-02-08T01:20:09Z) - Refining Adaptive Zeroth-Order Optimization at Ease [24.327161891577727]
This paper introduces Refined Adaptive Zeroth-Order Optimization (R-AdaZO)<n>We first show the untapped variance reduction effect of first moment estimate on ZO gradient estimation.<n>We then refine the second moment estimate based on these variance-reduced gradient estimates to better capture the geometry of the optimization landscape.
arXiv Detail & Related papers (2025-02-03T03:10:44Z) - HALO: Hadamard-Assisted Lower-Precision Optimization for LLMs [45.37278584462772]
We present HALO, a novel quantization-aware training approach for Transformers.<n>Our approach ensures that all large matrix multiplications during the forward and backward passes are executed in lower precision.<n>Applying to LLAMA-family models, HALO achieves near-full-precision-equivalent results during fine-tuning on various tasks.
arXiv Detail & Related papers (2025-01-05T18:41:54Z) - HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-tuning LLM with Zeroth-order Optimization [18.00873866263434]
Fine-tuning large language models (LLMs) poses significant memory challenges.
Recent work, MeZO, addresses this issue using a zeroth-order (ZO) optimization method.
We introduce HELENE, a novel scalable and memory-efficient pre-conditioner.
arXiv Detail & Related papers (2024-11-16T04:27:22Z) - Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models [33.911521719528686]
Fine-tuning is powerful for adapting large language models to downstream tasks, but it often results in huge memory usages.
A promising approach is using Zeroth-Order (ZO) gradients, which estimates to replace First-Order (FO) gradients.
We introduce a novel layer-wise sparse computation and memory efficient ZO, named LeZO.
arXiv Detail & Related papers (2024-10-13T12:47:37Z) - Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [66.27334633749734]
As language models grow in size, memory demands for backpropagation increase.
Zeroth-order (ZOZO) optimization methods offer a memory-efficient alternative.
We show that SubZero enhances fine-tuning and achieves faster results compared to standard ZOZO approaches.
arXiv Detail & Related papers (2024-10-11T17:01:43Z) - Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures [21.18741772731095]
Zeroth-order (ZO) algorithms offer a promising alternative by approximating gradients using finite differences of function values.
Existing ZO methods struggle to capture the low-rank gradient structure common in LLM fine-tuning, leading to suboptimal performance.
This paper proposes a low-rank ZO algorithm (LOZO) that effectively captures this structure in LLMs.
arXiv Detail & Related papers (2024-10-10T08:10:53Z) - Multi-fidelity Constrained Optimization for Stochastic Black Box
Simulators [1.6385815610837167]
We introduce the algorithm Scout-Nd (Stochastic Constrained Optimization for N dimensions) to tackle the issues mentioned earlier.
Scout-Nd efficiently estimates the gradient, reduces the noise of the estimator gradient, and applies multi-fidelity schemes to further reduce computational effort.
We validate our approach on standard benchmarks, demonstrating its effectiveness in optimizing parameters highlighting better performance compared to existing methods.
arXiv Detail & Related papers (2023-11-25T23:36:38Z) - A Specialized Semismooth Newton Method for Kernel-Based Optimal
Transport [92.96250725599958]
Kernel-based optimal transport (OT) estimators offer an alternative, functional estimation procedure to address OT problems from samples.
We show that our SSN method achieves a global convergence rate of $O (1/sqrtk)$, and a local quadratic convergence rate under standard regularity conditions.
arXiv Detail & Related papers (2023-10-21T18:48:45Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - Self-Tuning Stochastic Optimization with Curvature-Aware Gradient
Filtering [53.523517926927894]
We explore the use of exact per-sample Hessian-vector products and gradients to construct self-tuning quadratics.
We prove that our model-based procedure converges in noisy gradient setting.
This is an interesting step for constructing self-tuning quadratics.
arXiv Detail & Related papers (2020-11-09T22:07:30Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.