Related papers: FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models

FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models

URL: http://arxiv.org/abs/2505.17967v4
Date: Wed, 08 Oct 2025 15:33:25 GMT
Title: FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models
Authors: Ionut-Vlad Modoranu, Mher Safaryan, Erik Schultheis, Max Ryabinin, Artem Chumachenko, Dan Alistarh,
Abstract summary: We propose a two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces.<n>We show that our strategy achieves faster runtime and reduced memory usage by up to $25%$ across different model sizes.
Score: 49.397861654088636
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to improve running time and reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD) or QR-decomposition. Applying these techniques individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple, two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces by using a predefined orthogonal matrix of the Discrete Cosine Transform (DCT). We dynamically select columns from the DCT matrix based on their alignment with the gradient of each layer. The effective projection matrices are obtained via a simple matmul with the DCT matrix in $O(n^3)$ time, followed by a lightweight sorting step to identify the most relevant basis vectors. For large layers, DCT can be computed via Makhoul's $N$-point algorithm based on Fast Fourier Transform (FFT) in $O(n^2 \log(n))$ time. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, obtaining an approach with rank-independent running time that matches the performance of costly SVD/QR-based methods while achieving faster runtime and reduced memory usage by up to $25\%$ across different model sizes. Our code is available at \href{https://github.com/IST-DASLab/ISTA-DASLab-Optimizers}{\texttt{https://github.com/IST-DASLab/ISTA-DASLab-Optimizers}}.

Related papers

A Minimalist Optimizer Design for LLM Pretraining [31.996047271119156]
Training large language models typically relies on adaptives such as Adam.<n>Recent works such as GaLore Fira, and APOLLO have proposed state-compressed variants to reduce memory consumption.<n>In this work, we investigate what is the minimal amount of state that is truly necessary to retain state-of-the-art performance in LLM pretraining.
arXiv Detail & Related papers (2025-06-20T00:10:35Z)
FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA [68.44043212834204]
Low-Rank Adaptation (LoRA) is widely used for efficient fine-tuning of language models in learning (FL)<n>Low-Rank Adaptation (LoRA) is widely used for efficient fine-tuning of language models in learning (FL)
arXiv Detail & Related papers (2025-05-19T07:32:56Z)
Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation [53.88562288388169]
A common strategy for. Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViTs) involves adapting the model to downstream tasks. We propose a novel PEFT approach inspired by Singular Value Decomposition (SVD) for representing the adaptation matrix. SVD decomposes a matrix into the product of a left unitary matrix, a diagonal matrix of scaling values, and a right unitary matrix.
arXiv Detail & Related papers (2024-10-30T12:08:30Z)
Unified Gradient-Based Machine Unlearning with Remain Geometry Enhancement [29.675650285351768]
Machine unlearning (MU) has emerged to enhance the privacy and trustworthiness of deep neural networks. Approximate MU is a practical method for large-scale models. We propose a fast-slow parameter update strategy to implicitly approximate the up-to-date salient unlearning direction.
arXiv Detail & Related papers (2024-09-29T15:17:33Z)
Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain. We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance. We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z)
PMaF: Deep Declarative Layers for Principal Matrix Features [37.662505982849844]
We explore two differentiable deep declarative layers, namely least squares on sphere (LESS) and implicit eigen decomposition (IED) LESS can be used to represent data features with a low-dimensional vector containing dominant information from a high-dimensional matrix. IED can be used to represent data features with a low-dimensional vector containing dominant information from a high-dimensional matrix.
arXiv Detail & Related papers (2023-06-26T15:13:36Z)
Layer-wise Adaptive Step-Sizes for Stochastic First-Order Methods for Deep Learning [8.173034693197351]
We propose a new per-layer adaptive step-size procedure for first-order optimization methods in deep learning. The proposed approach exploits the layer-wise curvature information contained in the diagonal blocks of the Hessian in deep neural networks (DNNs) to compute adaptive step-sizes (i.e., LRs) for each layer. Numerical experiments show that SGD with momentum and AdamW combined with the proposed per-layer step-sizes are able to choose effective LR schedules.
arXiv Detail & Related papers (2023-05-23T04:12:55Z)
projUNN: efficient method for training deep networks with unitary matrices [21.11571804661279]
We introduce two variants of a method that can parameterize full $N$-dimensional unitary or matrices with a training runtime scaling as $O(kN2)$. Even in the fastest setting, projUNN is able to train a model's unitary parameters to reach comparable performances against baseline implementations.
arXiv Detail & Related papers (2022-03-10T17:04:41Z)
Memory-Efficient Backpropagation through Large Linear Layers [107.20037639738433]
In modern neural networks like Transformers, linear layers require significant memory to store activations during backward pass. This study proposes a memory reduction approach to perform backpropagation through linear layers.
arXiv Detail & Related papers (2022-01-31T13:02:41Z)
Implicit SVD for Graph Representation Learning [33.761179632722]
We make Graph Representational Learning more computationally tractable for those with modest hardware. We derive linear approximation of a SOTA model, and train the model, in closed-form, via SVD of $mathbfM$, without calculating entries. Our models show competitive empirical test performance over various graphs.
arXiv Detail & Related papers (2021-11-11T16:58:17Z)
Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer. In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph. Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z)
Efficient Matrix-Free Approximations of Second-Order Information, with Applications to Pruning and Optimization [16.96639526117016]
We investigate matrix-free, linear-time approaches for estimating Inverse-Hessian Vector Products (IHVPs) These algorithms yield state-of-the-art results for network pruning and optimization with lower computational overhead relative to existing second-order methods.
arXiv Detail & Related papers (2021-07-07T17:01:34Z)
Why Approximate Matrix Square Root Outperforms Accurate SVD in Global Covariance Pooling? [59.820507600960745]
We propose a new GCP meta-layer that uses SVD in the forward pass, and Pad'e Approximants in the backward propagation to compute the gradients. The proposed meta-layer has been integrated into different CNN models and achieves state-of-the-art performances on both large-scale and fine-grained datasets.
arXiv Detail & Related papers (2021-05-06T08:03:45Z)
On the Efficient Implementation of the Matrix Exponentiated Gradient Algorithm for Low-Rank Matrix Optimization [26.858608065417663]
Convex optimization over the spectrahedron has important applications in machine learning, signal processing and statistics. We propose efficient implementations of MEG, which are tailored for optimization with low-rank matrices, and only use a single low-rank SVD on each iteration. We also provide efficiently-computable certificates for the correct convergence of our methods.
arXiv Detail & Related papers (2020-12-18T19:14:51Z)
Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality Regularization and Singular Value Sparsification [53.50708351813565]
We propose SVD training, the first method to explicitly achieve low-rank DNNs during training without applying SVD on every step. We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy.
arXiv Detail & Related papers (2020-04-20T02:40:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.