Related papers: ReLoRA: High-Rank Training Through Low-Rank Updates

ReLoRA: High-Rank Training Through Low-Rank Updates

URL: http://arxiv.org/abs/2307.05695v4
Date: Sun, 10 Dec 2023 16:21:29 GMT
Title: ReLoRA: High-Rank Training Through Low-Rank Updates
Authors: Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, Anna Rumshisky
Abstract summary: We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. ReLoRA saves up to 5.5Gb of RAM per GPU and improves training speed by 9-40% depending on the model size and hardware setup.
Score: 14.606961537327345
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparameterized models remains poorly understood, while training costs grow exponentially. In this paper, we explore parameter-efficient training techniques as an approach to training large neural networks. We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. We apply ReLoRA to training transformer language models with up to 1.3B parameters and demonstrate comparable performance to regular neural network training. ReLoRA saves up to 5.5Gb of RAM per GPU and improves training speed by 9-40% depending on the model size and hardware setup. Our findings show the potential of parameter-efficient techniques for large-scale pre-training.

Related papers

SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information [3.6859322366469933]
SwitchLoRA is a parameter-efficient training technique that frequently and smoothly replaces the trainable parameters of LoRA with alternative parameters. Our results demonstrate that SwitchLoRA actually surpasses full-rank training.
arXiv Detail & Related papers (2024-06-03T05:40:34Z)
An Empirical Analysis of Forgetting in Pre-trained Models with Incremental Low-Rank Updates [11.90029443742706]
We study the impact of Low-Rank Adaptation (LoRA) rank on the forgetting of the pretraining foundation task and on the plasticity and forgetting of subsequent ones. We also observe that vision transformers finetuned in that way exhibit a sort of contextual'' forgetting, a behaviour that we do not observe for residual networks.
arXiv Detail & Related papers (2024-05-28T11:29:25Z)
Sparse Spectral Training and Inference on Euclidean and Hyperbolic Neural Networks [9.96381061452642]
We propose Sparse Spectral Training (SST), an advanced training methodology that updates all singular values and selectively updates singular vectors of network weights. SST refines the training process by employing a targeted updating strategy for singular vectors, which is determined by a multinomial sampling method weighted by the significance of the singular values. On OPT-125M, with rank equating to 8.3% of embedding dimension, SST reduces the perplexity gap to full-rank training by 67.6%, demonstrating a significant reduction of the performance loss with prevalent low-rank methods.
arXiv Detail & Related papers (2024-05-24T11:59:41Z)
Always-Sparse Training by Growing Connections with Guided Stochastic Exploration [46.4179239171213]
We propose an efficient always-sparse training algorithm with excellent scaling to larger and sparser models. We evaluate our method on CIFAR-10/100 and ImageNet using VGG, and ViT models, and compare it against a range of sparsification methods.
arXiv Detail & Related papers (2024-01-12T21:32:04Z)
Run LoRA Run: Faster and Lighter LoRA Implementations [50.347242693025336]
LoRA is a technique that reduces the number of trainable parameters in a neural network by introducing low-rank adapters to linear layers. This paper presents the RunLoRA framework for efficient implementations of LoRA. Experiments show up to 28% speedup on language modeling networks.
arXiv Detail & Related papers (2023-12-06T10:54:34Z)
Low-rank Adaptation Method for Wav2vec2-based Fake Audio Detection [57.537583869961885]
Self-supervised speech models are a rapidly developing research topic in fake audio detection. We apply low-rank adaptation(LoRA) to the wav2vec2 model, freezing the pre-trained model weights and injecting a trainable rank-decomposition matrix into each layer of the transformer architecture. Compared with fine-tuning with Adam on the wav2vec2 model containing 317M training parameters, LoRA achieved similar performance by reducing the number of trainable parameters by 198 times.
arXiv Detail & Related papers (2023-06-09T01:43:41Z)
Dimensionality Reduced Training by Pruning and Freezing Parts of a Deep Neural Network, a Survey [69.3939291118954]
State-of-the-art deep learning models have a parameter count that reaches into the billions. Training, storing and transferring such models is energy and time consuming, thus costly. Model compression lowers storage and transfer costs, and can further make training more efficient by decreasing the number of computations in the forward and/or backward pass. This work is a survey on methods which reduce the number of trained weights in deep learning models throughout the training.
arXiv Detail & Related papers (2022-05-17T05:37:08Z)
Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution. Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x. We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z)
M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint. We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z)
LoRA: Low-Rank Adaptation of Large Language Models [71.75808607987281]
Low-Rank Adaptation, or LoRA, freezes the pre-trained model weights and injects trainable rank decomposition into each layer of the Transformer architecture. For GPT-3, LoRA can reduce the number of trainable parameters by 10,000 times and the computation hardware requirement by 3 times compared to full fine-tuning.
arXiv Detail & Related papers (2021-06-17T17:37:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.