A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models
- URL: http://arxiv.org/abs/2502.07222v1
- Date: Tue, 11 Feb 2025 03:32:10 GMT
- Title: A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models
- Authors: Yiming Chen, Yuan Zhang, Yin Liu, Kun Yuan, Zaiwen Wen,
- Abstract summary: We introduce a Randomized Subspace Optimization framework for pre-training and fine-tuning Large Language Models.<n>Our approach decomposes the high-dimensional training problem into a series of lower-dimensional subproblems.<n>This structured reduction in dimensionality allows our method to simultaneously reduce memory usage for both activations and states.
- Score: 22.725326215887435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The memory challenges associated with training Large Language Models (LLMs) have become a critical concern, particularly when using the Adam optimizer. To address this issue, numerous memory-efficient techniques have been proposed, with GaLore standing out as a notable example designed to reduce the memory footprint of optimizer states. However, these approaches do not alleviate the memory burden imposed by activations, rendering them unsuitable for scenarios involving long context sequences or large mini-batches. Moreover, their convergence properties are still not well-understood in the literature. In this work, we introduce a Randomized Subspace Optimization framework for pre-training and fine-tuning LLMs. Our approach decomposes the high-dimensional training problem into a series of lower-dimensional subproblems. At each iteration, a random subspace is selected, and the parameters within that subspace are optimized. This structured reduction in dimensionality allows our method to simultaneously reduce memory usage for both activations and optimizer states. We establish comprehensive convergence guarantees and derive rates for various scenarios, accommodating different optimization strategies to solve the subproblems. Extensive experiments validate the superior memory and communication efficiency of our method, achieving performance comparable to GaLore and Adam.
Related papers
- COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs [81.01082659623552]
Large Language Models (LLMs) have demonstrated remarkable success across various domains.
Their optimization remains a significant challenge due to the complex and high-dimensional loss landscapes they inhabit.
arXiv Detail & Related papers (2025-02-24T18:42:19Z) - I3S: Importance Sampling Subspace Selection for Low-Rank Optimization in LLM Pretraining [50.89661053183944]
Low-rank optimization has emerged as a promising approach to enabling memory-efficient training of large language models (LLMs)<n>Existing low-rank optimization methods typically project gradients onto a low-rank subspace, reducing the memory cost of storing states.<n>We propose importance sampling subspace selection (I3S) for low-rank optimization, which theoretically offers a comparable convergence rate to the dominant subspace approach.
arXiv Detail & Related papers (2025-02-09T06:30:19Z) - Sparse Gradient Compression for Fine-Tuning Large Language Models [58.44973963468691]
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models.<n>High memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size.<n>We propose sparse compression gradient (SGC) to address these limitations.
arXiv Detail & Related papers (2025-02-01T04:18:28Z) - Breaking Memory Limits: Gradient Wavelet Transform Enhances LLMs Training [45.225732322141994]
Large language models (LLMs) have impressive performance across a range of natural language processing tasks.<n>Their vast number of parameters introduces significant memory challenges during training.<n>Existing memory-efficient algorithms often rely on techniques such as singular value decomposition projection or weight freezing.<n>We propose a novel solution called Gradient Wavelet Transform (GWT), which applies wavelet transforms to gradients in order to significantly reduce the memory requirements.
arXiv Detail & Related papers (2025-01-13T11:35:09Z) - Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [66.27334633749734]
As language models grow in size, memory demands for backpropagation increase.
Zeroth-order (ZOZO) optimization methods offer a memory-efficient alternative.
We show that SubZero enhances fine-tuning and achieves faster results compared to standard ZOZO approaches.
arXiv Detail & Related papers (2024-10-11T17:01:43Z) - Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark [166.40879020706151]
This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during fine-tuning.
Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques.
Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance.
arXiv Detail & Related papers (2024-02-18T14:08:48Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Vocabulary-level Memory Efficiency for Language Model Fine-tuning [36.1039389951318]
We show that a significant proportion of the vocabulary remains unused during fine-tuning.
We propose a simple yet effective approach that leverages this finding to minimize memory usage.
Our approach does not impact downstream task performance, while allowing more efficient use of computational resources.
arXiv Detail & Related papers (2023-09-15T19:00:00Z) - CAME: Confidence-guided Adaptive Memory Efficient Optimization [20.009302737137787]
Adaptive gradient methods have demonstrated excellent performance in the training of large language models.
The need for maintaining second-moment estimates requires maintaining a high cost of extra memory overheads.
Several memory-efficients have been proposed to obtain a drastic reduction in auxiliary memory usage, but with a performance penalty.
We propose CAME to simultaneously achieve two goals: fast convergence as in traditional adaptive methods, and low memory usage as in memory-efficient methods.
arXiv Detail & Related papers (2023-07-05T06:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.