Related papers: Memory-Efficient LLM Training with Online Subspace Descent

Memory-Efficient LLM Training with Online Subspace Descent

URL: http://arxiv.org/abs/2408.12857v1
Date: Fri, 23 Aug 2024 05:54:53 GMT
Title: Memory-Efficient LLM Training with Online Subspace Descent
Authors: Kaizhao Liang, Bo Liu, Lizhang Chen, Qiang Liu,
Abstract summary: We propose Online Subspace Descent, a new family of subspace descent without singular value decomposition (SVD) Online Subspace Descent is flexible and introduces only minimum overhead to training. We show that for the task of pretraining LLaMA models ranging from 60M to 7B parameters on the C4 dataset, Online Subspace Descent achieves lower perplexity and better downstream tasks performance.
Score: 8.393403749426097
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, a wide range of memory-efficient LLM training algorithms have gained substantial popularity. These methods leverage the low-rank structure of gradients to project optimizer states into a subspace using projection matrix found by singular value decomposition (SVD). However, convergence of these algorithms is highly dependent on the update rules of their projection matrix. In this work, we provide the \emph{first} convergence guarantee for arbitrary update rules of projection matrix. This guarantee is generally applicable to optimizers that can be analyzed with Hamiltonian Descent, including most common ones, such as LION, Adam. Inspired by our theoretical understanding, we propose Online Subspace Descent, a new family of subspace descent optimizer without SVD. Instead of updating the projection matrix with eigenvectors, Online Subspace Descent updates the projection matrix with online PCA. Online Subspace Descent is flexible and introduces only minimum overhead to training. We show that for the task of pretraining LLaMA models ranging from 60M to 7B parameters on the C4 dataset, Online Subspace Descent achieves lower perplexity and better downstream tasks performance than state-of-the-art low-rank training methods across different settings and narrows the gap with full-rank baselines.

Related papers

Regularizing Subspace Redundancy of Low-Rank Adaptation [54.473090597164834]
We propose ReSoRA, a method that explicitly models redundancy between mapping subspaces and adaptively Regularizes Subspace redundancy of Low-Rank Adaptation.<n>Our proposed method consistently facilitates existing state-of-the-art PETL methods across various backbones and datasets in vision-language retrieval and standard visual classification benchmarks.<n>As a training supervision, ReSoRA can be seamlessly integrated into existing approaches in a plug-and-play manner, with no additional inference costs.
arXiv Detail & Related papers (2025-07-28T11:52:56Z)
SVD-Free Low-Rank Adaptive Gradient Optimization for Large Language Models [37.60342078872549]
We propose a two-step procedure to approximate SVD-based gradient projections into lower-dimensional spaces.<n>Our experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy.
arXiv Detail & Related papers (2025-05-23T14:37:00Z)
I3S: Importance Sampling Subspace Selection for Low-Rank Optimization in LLM Pretraining [50.89661053183944]
Low-rank optimization has emerged as a promising approach to enabling memory-efficient training of large language models (LLMs) Existing low-rank optimization methods typically project gradients onto a low-rank subspace, reducing the memory cost of storing states. We propose importance sampling subspace selection (I3S) for low-rank optimization, which theoretically offers a comparable convergence rate to the dominant subspace approach.
arXiv Detail & Related papers (2025-02-09T06:30:19Z)
DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models [33.4538652558253]
Low-rank adaptation (LoRA) reduces the computational and memory demands of fine-tuning large language models (LLMs) by approximating updates with low-rank matrices. We propose Weight-Decomposed Adaptation (DoTA), which leverages the Matrix Product Operator (MPO) decomposition of pre-trained weights. We also introduce QDoTA, a quantized version of DoTA designed for 4-bit quantization.
arXiv Detail & Related papers (2024-12-30T12:00:47Z)
Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation [53.88562288388169]
A common strategy for. Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViTs) involves adapting the model to downstream tasks. We propose a novel PEFT approach inspired by Singular Value Decomposition (SVD) for representing the adaptation matrix. SVD decomposes a matrix into the product of a left unitary matrix, a diagonal matrix of scaling values, and a right unitary matrix.
arXiv Detail & Related papers (2024-10-30T12:08:30Z)
Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [66.27334633749734]
As language models grow in size, memory demands for backpropagation increase. Zeroth-order (ZOZO) optimization methods offer a memory-efficient alternative. We show that SubZero enhances fine-tuning and achieves faster results compared to standard ZOZO approaches.
arXiv Detail & Related papers (2024-10-11T17:01:43Z)
LoRTA: Low Rank Tensor Adaptation of Large Language Models [70.32218116940393]
Low Rank Adaptation (LoRA) is a popular Efficient Fine Tuning (PEFT) method. We propose a higher-order Candecomp/Parafac (CP) decomposition, enabling a more compact and flexible representation. Our method can achieve a reduction in the number of parameters while maintaining comparable performance.
arXiv Detail & Related papers (2024-10-05T06:59:50Z)
From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients [86.40635601953446]
We study the emergence of low-rank structures across different layers of Modern Large Language Models. We present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE.
arXiv Detail & Related papers (2024-07-15T21:05:20Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning [16.67302585857681]
We propose MiLoRA, a simple yet effective LLM finetuning approach that only updates the minor singular components of the weight matrix. It is observed that the minor matrix corresponds to the noisy or long-tail information, while the principal matrix contains important knowledge. During finetuning, MiLoRA makes the most use of the less-optimized subspace for learning the labeled dataset.
arXiv Detail & Related papers (2024-06-13T12:30:02Z)
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [67.38165028487242]
We introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach to fine-tune large language models (LLMs) Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs. Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs.
arXiv Detail & Related papers (2023-10-13T07:38:52Z)
Mode-wise Principal Subspace Pursuit and Matrix Spiked Covariance Model [13.082805815235975]
We introduce a novel framework called Mode-wise Principal Subspace Pursuit (MOP-UP) to extract hidden variations in both the row and column dimensions for matrix data. The effectiveness and practical merits of the proposed framework are demonstrated through experiments on both simulated and real datasets.
arXiv Detail & Related papers (2023-07-02T13:59:47Z)
Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer. In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph. Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z)
Online high rank matrix completion [39.570686604641836]
Recent advances in matrix completion enable data imputation in full-rank matrices by exploiting low dimensional (nonlinear) latent structure. We develop a new model for high rank matrix completion, together with batch and online methods to fit the model and out-of-sample extension.
arXiv Detail & Related papers (2020-02-20T18:31:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.