PreLoRA: Hybrid Pre-training of Vision Transformers with Full Training and Low-Rank Adapters
- URL: http://arxiv.org/abs/2509.21619v1
- Date: Thu, 25 Sep 2025 21:34:17 GMT
- Title: PreLoRA: Hybrid Pre-training of Vision Transformers with Full Training and Low-Rank Adapters
- Authors: Krishu K Thapa, Reet Barik, Krishna Teja Chitty-Venkata, Murali Emani, Venkatram Vishwanath,
- Abstract summary: We propose an approach to identify states of partial convergence and switch from full parameter training to Low-Rank Adaptation (LoRA) on the ViT-Large model.<n> Experimental results show that this approach preserves model accuracy while reducing the number of trainable parameters to 10% of its original size.
- Score: 2.5547655072779
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training large models ranging from millions to billions of parameters is highly resource-intensive, requiring significant time, compute, and memory. It is observed that most of the learning (higher change in weights) takes place in the earlier stage of the training loop. These changes stabilize as training continues, enabling them to be captured by matrices of a low intrinsic rank. Therefore, we propose an approach to identify such states of partial convergence and dynamically switch from full parameter training to Low-Rank Adaptation (LoRA) on the ViT-Large model. We introduce a flexible approach that leverages user-defined hyperparameters to determine the switching point and assign a rank specific to each module layer based on its level of convergence. Experimental results show that this approach preserves model accuracy while reducing the number of trainable parameters to 10% of its original size, resulting in a 3x improvement in throughput, and a 1.5x reduction in average training time per epoch while also reducing GPU memory consumption by 20%
Related papers
- AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning [9.51289606759621]
Training and fine-tuning large language models (LLMs) come with challenges related to memory and computational requirements.<n>Various techniques have been developed to tackle these challenges, such as low-rank adaptation (LoRA)<n>We introduce a new method inspired by a phenomenon we formally prove: as training progresses, the rank of the estimated gradient gradually decreases.
arXiv Detail & Related papers (2024-10-23T13:53:26Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information [3.6859322366469933]
Methods like ReLoRA and GaLore have attempted to address this challenge by updating the low-rank subspace.<n>In this paper, we introduce SwitchLoRA, a parameter-efficient training technique that frequently and smoothly replaces the trainable parameters of LoRA with alternative parameters.
arXiv Detail & Related papers (2024-06-03T05:40:34Z) - Sparse Spectral Training and Inference on Euclidean and Hyperbolic Neural Networks [9.96381061452642]
Low-Rank Adaptation (LoRA) and ReLoRA face challenges with their low-rank structure.<n>We propose Sparse Spectral Training (SST) to optimize memory usage for pre-training.<n>SST reduces the perplexity gap between other low-rank methods and full-rank training by 97.4%.
arXiv Detail & Related papers (2024-05-24T11:59:41Z) - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [133.45193150403537]
Training Large Language Models (LLMs) presents significant memory challenges due to the growing size of weights and GPU states.
In this work, we propose Gradient Low-Rank Projection (GaLore) as a memory-efficient training strategy.
Our 8-bit GaLore further reduces memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.
arXiv Detail & Related papers (2024-03-06T07:29:57Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - Dynamic Layer Tying for Parameter-Efficient Transformers [65.268245109828]
We employ Reinforcement Learning to select layers during training and tie them together.
This facilitates weight sharing, reduces the number of trainable parameters, and also serves as an effective regularization technique.
In particular, the memory consumption during training is up to one order of magnitude less than the conventional training method.
arXiv Detail & Related papers (2024-01-23T14:53:20Z) - ReLoRA: High-Rank Training Through Low-Rank Updates [14.606961537327345]
We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks.
ReLoRA saves up to 5.5Gb of RAM per GPU and improves training speed by 9-40% depending on the model size and hardware setup.
arXiv Detail & Related papers (2023-07-11T18:02:09Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Cuttlefish: Low-Rank Model Training without All the Tuning [55.984294012024755]
We introduce Cuttlefish, an automated low-rank training approach.
Cuttlefish switches from full-rank to low-rank training once the stable ranks of all layers have converged.
Our results show that Cuttlefish generates models up to 5.6 times smaller than full-rank models, and attains up to a 1.2 times faster end-to-end training process.
arXiv Detail & Related papers (2023-05-04T04:20:20Z) - A Fast and Efficient Conditional Learning for Tunable Trade-Off between
Accuracy and Robustness [11.35810118757863]
Existing models that achieve state-of-the-art (SOTA) performance on both clean and adversarially-perturbed images rely on convolution operations conditioned with feature-wise linear modulation (FiLM) layers.
We present a fast learnable once-for-all adversarial training (FLOAT) algorithm, which instead of the existing FiLM-based conditioning, presents a unique weight conditioned learning that requires no additional layer.
In particular, we add scaled noise to the weight tensors that enables a trade-off between clean and adversarial performance.
arXiv Detail & Related papers (2022-03-28T19:25:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.