Related papers: TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs

TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs

URL: http://arxiv.org/abs/2501.19057v1
Date: Fri, 31 Jan 2025 11:34:03 GMT
Title: TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs
Authors: Yan Sun, Tiansheng Huang, Liang Ding, Li Shen, Dacheng Tao,
Abstract summary: We propose a novel low-rank ZO estimator, TeZO, which captures the low-rankness across both the model and temporal dimension.<n>Specifically, we represent ZO perturbations along the temporal dimension as a 3D tensor and employ Canonical Polyadic Decomposition (CPD) to extract each low-rank 2D matrix.
Score: 58.19080159470868
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Zeroth-order optimization (ZO) has demonstrated remarkable promise in efficient fine-tuning tasks for Large Language Models (LLMs). In particular, recent advances incorporate the low-rankness of gradients, introducing low-rank ZO estimators to further reduce GPU memory consumption. However, most existing works focus solely on the low-rankness of each individual gradient, overlooking a broader property shared by all gradients throughout the training, i.e., all gradients approximately reside within a similar subspace. In this paper, we consider two factors together and propose a novel low-rank ZO estimator, TeZO, which captures the low-rankness across both the model and temporal dimension. Specifically, we represent ZO perturbations along the temporal dimension as a 3D tensor and employ Canonical Polyadic Decomposition (CPD) to extract each low-rank 2D matrix, significantly reducing the training cost. TeZO can also be easily extended to the Adam variant while consuming less memory than MeZO-SGD, and requiring about only 35% memory of MeZO-Adam. Both comprehensive theoretical analysis and extensive experimental research have validated its efficiency, achieving SOTA-comparable results with lower overhead of time and memory.

Related papers

CompGS++: Compressed Gaussian Splatting for Static and Dynamic Scene Representation [60.712165339762116]
CompGS++ is a novel framework that leverages compact Gaussian primitives to achieve accurate 3D modeling. Our design is based on the principle of eliminating redundancy both between and within primitives. Our implementation will be made publicly available on GitHub to facilitate further research.
arXiv Detail & Related papers (2025-04-17T15:33:01Z)
Second-order Optimization of Gaussian Splats with Importance Sampling [51.95046424364725]
3D Gaussian Splatting (3DGS) is widely used for novel view rendering due to its high quality and fast inference time. We propose a novel second-order optimization strategy based on Levenberg-Marquardt (LM) and Conjugate Gradient (CG) Our method achieves a $3times$ speedup over standard LM and outperforms Adam by $6times$ when the Gaussian count is low.
arXiv Detail & Related papers (2025-04-17T12:52:08Z)
Micro-splatting: Maximizing Isotropic Constraints for Refined Optimization in 3D Gaussian Splatting [0.3749861135832072]
This work implements an adaptive densification strategy that dynamically refines regions with high image gradients. It results in a denser and more detailed gaussian means where needed, without sacrificing rendering efficiency.
arXiv Detail & Related papers (2025-04-08T07:15:58Z)
Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning [37.507489928116804]
Large language models (LLMs) excel across various tasks, but standard first-order (FO) fine-tuning demands considerable memory. We introduce a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization. We propose textbfDivergence-driven textbfZeroth-textbfOrder (textbfDiZO) optimization.
arXiv Detail & Related papers (2025-02-05T16:03:17Z)
HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-tuning LLM with Zeroth-order Optimization [18.00873866263434]
Fine-tuning large language models (LLMs) poses significant memory challenges. Recent work, MeZO, addresses this issue using a zeroth-order (ZO) optimization method. We introduce HELENE, a novel scalable and memory-efficient pre-conditioner.
arXiv Detail & Related papers (2024-11-16T04:27:22Z)
Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models [33.911521719528686]
Fine-tuning is powerful for adapting large language models to downstream tasks, but it often results in huge memory usages. A promising approach is using Zeroth-Order (ZO) gradients, which estimates to replace First-Order (FO) gradients. We introduce a novel layer-wise sparse computation and memory efficient ZO, named LeZO.
arXiv Detail & Related papers (2024-10-13T12:47:37Z)
Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [66.27334633749734]
As language models grow in size, memory demands for backpropagation increase. Zeroth-order (ZOZO) optimization methods offer a memory-efficient alternative. We show that SubZero enhances fine-tuning and achieves faster results compared to standard ZOZO approaches.
arXiv Detail & Related papers (2024-10-11T17:01:43Z)
Variance-reduced Zeroth-Order Methods for Fine-Tuning Language Models [17.027512781038617]
Zeroth-order (ZO) optimization methods can leverage memory-efficient forward passes to estimate. MeZO, an adaptation of ZO-SGD, has been shown to consistently outperform zero-shot and in-context learning. MeZO-SVRG significantly reduces the required memory footprint compared to first-order SGD.
arXiv Detail & Related papers (2024-04-11T18:35:49Z)
Fine-Tuning Language Models with Just Forward Passes [92.04219196752007]
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a large amount of memory. We propose a memory-efficient zerothorder (MeZO) to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference.
arXiv Detail & Related papers (2023-05-27T02:28:10Z)
P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation [78.83305967085413]
This paper introduces a novel Pre-trained Spatial Temporal Many-to-One (P-STMO) model for 2D-to-3D human pose estimation task. Our method outperforms state-of-the-art methods with fewer parameters and less computational overhead.
arXiv Detail & Related papers (2022-03-15T04:00:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.