LoRDO: Distributed Low-Rank Optimization with Infrequent Communication
- URL: http://arxiv.org/abs/2602.04396v1
- Date: Wed, 04 Feb 2026 10:25:24 GMT
- Title: LoRDO: Distributed Low-Rank Optimization with Infrequent Communication
- Authors: Andrej Jovanović, Alex Iacob, Mher Safaryan, Ionut-Vlad Modoranu, Lorenzo Sani, William F. Shen, Xinchi Qiu, Dan Alistarh, Nicholas D. Lane,
- Abstract summary: $texttLoRDO$ is a principled framework for low-rank optimization with infrequent synchronization.<n>We show that $texttLoRDO$ achieves near-parity with low-rank $texttDDP$ in language modeling and downstream tasks.<n>We also show that $texttLoRDO$ improves performance even more in very low-memory settings with small rank/batch size.
- Score: 43.00539790635802
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Distributed training of foundation models via $\texttt{DDP}$ is limited by interconnect bandwidth. While infrequent communication strategies reduce synchronization frequency, they remain bottlenecked by the memory and communication requirements of optimizer states. Low-rank optimizers can alleviate these constraints; however, in the local-update regime, workers lack access to the full-batch gradients required to compute low-rank projections, which degrades performance. We propose $\texttt{LoRDO}$, a principled framework unifying low-rank optimization with infrequent synchronization. We first demonstrate that, while global projections based on pseudo-gradients are theoretically superior, they permanently restrict the optimization trajectory to a low-rank subspace. To restore subspace exploration, we introduce a full-rank quasi-hyperbolic update. $\texttt{LoRDO}$ achieves near-parity with low-rank $\texttt{DDP}$ in language modeling and downstream tasks at model scales of $125$M--$720$M, while reducing communication by $\approx 10 \times$. Finally, we show that $\texttt{LoRDO}$ improves performance even more in very low-memory settings with small rank/batch size.
Related papers
- From $O(mn)$ to $O(r^2)$: Two-Sided Low-Rank Communication for Adam in Distributed Training with Memory Efficiency [28.885724420612323]
We propose TSR, which brings two-sided low-rank communication to Adam-family updates (TSR-Adam)<n>To further reduce the peak communication from subspace refresh, TSR-Adam adopts a randomized SVD-based refresh that avoids full-gradient synchronization.<n>Across pretraining from 60M to 1B model scales, TSR-Adam reduces average communicated bytes per step by $13times$, and on GLUE fine-tuning it reduces communication by $25times$, while achieving comparable performance.
arXiv Detail & Related papers (2026-02-08T15:23:09Z) - BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models [16.973973367103508]
Low-rank bottleneck architectures offer a promising solution to significantly reduce the training time and memory footprint.<n>We propose BOOST, an efficient training framework tailored for large-scale low-rank bottleneck architectures.<n>We show that BOOST achieves 1.46-1.91$times$ speedup over full-rank model baselines and 1.87-2.27$times$ speedup over low-rank model with naively integrated 3D parallelism.
arXiv Detail & Related papers (2025-12-13T01:50:18Z) - Evolution Strategies at the Hyperscale [57.75314521465674]
We introduce EGGROLL, an evolution strategies (ES) algorithm designed to scale backprop-free optimization to large population sizes.<n>ES is a set of powerful blackbox optimisation methods that can handle non-differentiable or noisy objectives.<n>EGGROLL overcomes these bottlenecks by generating random matrices $Ain mathbbRmtimes r, Bin mathbbRntimes r$ with $rll min(m,n)$ to form a low-rank matrix perturbation $A Btop$
arXiv Detail & Related papers (2025-11-20T18:56:05Z) - Communication-Efficient and Accurate Approach for Aggregation in Federated Low-Rank Adaptation [7.127777651952882]
Current Federated Low-Rank Adaptation (FedLoRA) methods face notable challenges due to inexact updates.<n>We propose textbfFederated textbfLow-textbfRank textbfAggregation with textbfNearly textbfAccurate Estimation (FLoRA-NA)<n>FLoRA-NA bridges the gap between local personalization and global generalization, addressing a key limitation of prior personalized FedLoRA approaches.
arXiv Detail & Related papers (2025-09-30T15:32:26Z) - Proving the Limited Scalability of Centralized Distributed Optimization via a New Lower Bound Construction [57.93371273485736]
We consider a centralized distributed learning setup where all workers jointly find an unbiased bound LDeltaepsilon2,$ better poly-logarithmically in $n$, even in the homogeneous (i.i.d.) case, where all workers access the same distribution.
arXiv Detail & Related papers (2025-06-30T13:27:39Z) - FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training [51.39495282347475]
We introduce $textttFRUGAL$ ($textbfF$ull-$textbfR$ank $textbfU$pdates with $textbfG$r$textbfA$dient sp$textbfL$itting, a new memory-efficient optimization framework.<n>Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam.
arXiv Detail & Related papers (2024-11-12T14:41:07Z) - ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training [22.940404796500985]
We propose a memory-efficient optimization algorithm for distributed training LLMs.<n>By synchronizing delayed gradients while computing new ones, ACCO reduces GPU idle time and supports heterogeneous hardware.<n>Compared to ZeRO-1, our approach is significantly faster and scales effectively across heterogeneous hardware.
arXiv Detail & Related papers (2024-06-03T08:23:45Z) - Transfer Q Star: Principled Decoding for LLM Alignment [105.89114186982972]
Transfer $Q*$ estimates the optimal value function for a target reward $r$ through a baseline model.
Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods.
arXiv Detail & Related papers (2024-05-30T21:36:12Z) - $\ extbf{A}^2\ extbf{CiD}^2$: Accelerating Asynchronous Communication in
Decentralized Deep Learning [0.0]
We introduce a principled asynchronous, randomized, gossip-based optimization algorithm which works thanks to a continuous local momentum named $textbfA2textbfCiD2$.
Our theoretical analysis proves accelerated rates compared to previous asynchronous decentralized baselines.
We show consistent improvement on the ImageNet dataset using up to 64 asynchronous workers.
arXiv Detail & Related papers (2023-06-14T06:52:07Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Faster Non-Convex Federated Learning via Global and Local Momentum [57.52663209739171]
textttFedGLOMO is the first (first-order) FLtexttFedGLOMO algorithm.
Our algorithm is provably optimal even with communication between the clients and the server.
arXiv Detail & Related papers (2020-12-07T21:05:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.