Stabilizing Native Low-Rank LLM Pretraining
- URL: http://arxiv.org/abs/2602.12429v1
- Date: Thu, 12 Feb 2026 21:33:14 GMT
- Title: Stabilizing Native Low-Rank LLM Pretraining
- Authors: Paul Janson, Edouard Oyallon, Eugene Belilovsky,
- Abstract summary: Low-rank factorization offers a promising route to reduce training and inference costs.<n>We demonstrate that Large Language Models (LLMs) can be trained from scratch using exclusively low-rank factorized weights.<n>Our method enables stable, end-to-end factorized training with negligible overhead.
- Score: 24.2079184778031
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Foundation models have achieved remarkable success, yet their growing parameter counts pose significant computational and memory challenges. Low-rank factorization offers a promising route to reduce training and inference costs, but the community lacks a stable recipe for training models from scratch using exclusively low-rank weights while matching the performance of the dense model. We demonstrate that Large Language Models (LLMs) can be trained from scratch using exclusively low-rank factorized weights for all non-embedding matrices without auxiliary "full-rank" guidance required by prior methods. While native low-rank training often suffers from instability and loss spikes, we identify uncontrolled growth in the spectral norm (largest singular value) of the weight matrix update as the dominant factor. To address this, we introduce Spectron: Spectral renormalization with orthogonalization, which dynamically bounds the resultant weight updates based on the current spectral norms of the factors. Our method enables stable, end-to-end factorized training with negligible overhead. Finally, we establish compute-optimal scaling laws for natively low-rank transformers, demonstrating predictable power-law behavior and improved inference efficiency relative to dense models.
Related papers
- Sparsity Induction for Accurate Post-Training Pruning of Large Language Models [23.002927923453118]
Post-training sparsity (PTS) reduces model cost by removing weights from dense networks.<n>But native dense matrices lack high sparsity, making existing approaches that directly remove weights disrupt model states.<n>We propose Sparsity Induction, which promotes models toward higher sparsity at both distribution and feature levels before pruning.
arXiv Detail & Related papers (2026-02-25T07:25:01Z) - Dynamic Rank Adjustment for Accurate and Efficient Neural Network Training [6.601283320267934]
We argue that strategically interleaving full-rank training epochs within low-rank training epochs can effectively restore the rank of the model's weights.<n>Our empirical study shows that the proposed method achieves almost the same computational cost as SVD-based low-rank training.
arXiv Detail & Related papers (2025-08-12T04:30:52Z) - Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling [70.8832906871441]
We study how to steer generation toward desired rewards without retraining the models.<n>Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement.<n>We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity.
arXiv Detail & Related papers (2025-07-11T08:00:47Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models [33.4538652558253]
Low-rank adaptation (LoRA) reduces the computational and memory demands of fine-tuning large language models (LLMs) by approximating updates with low-rank matrices.<n>We propose Weight-Decomposed Adaptation (DoTA), which leverages the Matrix Product Operator (MPO) decomposition of pre-trained weights.<n>We also introduce QDoTA, a quantized version of DoTA designed for 4-bit quantization.
arXiv Detail & Related papers (2024-12-30T12:00:47Z) - NEAT: Nonlinear Parameter-efficient Adaptation of Pre-trained Models [26.808251361020066]
Fine-tuning pre-trained models often yields state-of-the-art performance but is computationally expensive when updating all parameters.<n>We propose NEAT, a nonlinear PEFT approach that employs a lightweight neural network to learn a nonlinear transformation of the pre-trained weights.<n>Our theoretical analysis shows that NEAT achieves greater efficiency than LoRA while maintaining equivalent expressivity.
arXiv Detail & Related papers (2024-10-02T17:29:23Z) - Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners [51.32182730502002]
We introduce Singular-value Diagonal Expansion to refine weight distributions to achieve better quantization alignment.<n>Our plug-and-play weight-quantization methods demonstrate substantial performance improvements over state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-22T09:45:16Z) - From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications [85.17672240603011]
We study the non-uniform low-rank properties of weight matrices in Large Language Models.<n>We present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning into one.
arXiv Detail & Related papers (2024-07-15T21:05:20Z) - TRAWL: Tensor Reduced and Approximated Weights for Large Language Models [11.064868044313855]
We introduce TRAWL (Tensor Reduced and Approximated Weights for Large Language Models), a technique that applies tensor decomposition across multiple weight matrices to effectively denoise LLMs by capturing global structural patterns.<n>Our experiments show that TRAWL improves model performance by up to 16% over baseline models on benchmark datasets, without requiring additional data, training, or fine-tuning.
arXiv Detail & Related papers (2024-06-25T04:01:32Z) - Robust low-rank training via approximate orthonormal constraints [2.519906683279153]
We introduce a robust low-rank training algorithm that maintains the network's weights on the low-rank matrix manifold.
The resulting model reduces both training and inference costs while ensuring well-conditioning and thus better adversarial robustness, without compromising model accuracy.
arXiv Detail & Related papers (2023-06-02T12:22:35Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.