Related papers: LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

URL: http://arxiv.org/abs/2602.04396v1
Date: Wed, 04 Feb 2026 10:25:24 GMT
Title: LoRDO: Distributed Low-Rank Optimization with Infrequent Communication
Authors: Andrej Jovanović, Alex Iacob, Mher Safaryan, Ionut-Vlad Modoranu, Lorenzo Sani, William F. Shen, Xinchi Qiu, Dan Alistarh, Nicholas D. Lane,
Abstract summary: $texttLoRDO$ is a principled framework for low-rank optimization with infrequent synchronization.<n>We show that $texttLoRDO$ achieves near-parity with low-rank $texttDDP$ in language modeling and downstream tasks.<n>We also show that $texttLoRDO$ improves performance even more in very low-memory settings with small rank/batch size.
Score: 43.00539790635802
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Distributed training of foundation models via $\texttt{DDP}$ is limited by interconnect bandwidth. While infrequent communication strategies reduce synchronization frequency, they remain bottlenecked by the memory and communication requirements of optimizer states. Low-rank optimizers can alleviate these constraints; however, in the local-update regime, workers lack access to the full-batch gradients required to compute low-rank projections, which degrades performance. We propose $\texttt{LoRDO}$, a principled framework unifying low-rank optimization with infrequent synchronization. We first demonstrate that, while global projections based on pseudo-gradients are theoretically superior, they permanently restrict the optimization trajectory to a low-rank subspace. To restore subspace exploration, we introduce a full-rank quasi-hyperbolic update. $\texttt{LoRDO}$ achieves near-parity with low-rank $\texttt{DDP}$ in language modeling and downstream tasks at model scales of $125$M--$720$M, while reducing communication by $\approx 10 \times$. Finally, we show that $\texttt{LoRDO}$ improves performance even more in very low-memory settings with small rank/batch size.

Related papers

From $O(mn)$ to $O(r^2)$: Two-Sided Low-Rank Communication for Adam in Distributed Training with Memory Efficiency [28.885724420612323]
We propose TSR, which brings two-sided low-rank communication to Adam-family updates (TSR-Adam)<n>To further reduce the peak communication from subspace refresh, TSR-Adam adopts a randomized SVD-based refresh that avoids full-gradient synchronization.<n>Across pretraining from 60M to 1B model scales, TSR-Adam reduces average communicated bytes per step by $13times$, and on GLUE fine-tuning it reduces communication by $25times$, while achieving comparable performance.
arXiv Detail & Related papers (2026-02-08T15:23:09Z)
BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models [16.973973367103508]
Low-rank bottleneck architectures offer a promising solution to significantly reduce the training time and memory footprint.<n>We propose BOOST, an efficient training framework tailored for large-scale low-rank bottleneck architectures.<n>We show that BOOST achieves 1.46-1.91$times$ speedup over full-rank model baselines and 1.87-2.27$times$ speedup over low-rank model with naively integrated 3D parallelism.
arXiv Detail & Related papers (2025-12-13T01:50:18Z)
Evolution Strategies at the Hyperscale [57.75314521465674]
We introduce EGGROLL, an evolution strategies (ES) algorithm designed to scale backprop-free optimization to large population sizes.<n>ES is a set of powerful blackbox optimisation methods that can handle non-differentiable or noisy objectives.<n>EGGROLL overcomes these bottlenecks by generating random matrices $Ain mathbbRmtimes r, Bin mathbbRntimes r$ with $rll min(m,n)$ to form a low-rank matrix perturbation $A Btop$
arXiv Detail & Related papers (2025-11-20T18:56:05Z)
Communication-Efficient and Accurate Approach for Aggregation in Federated Low-Rank Adaptation [7.127777651952882]
Current Federated Low-Rank Adaptation (FedLoRA) methods face notable challenges due to inexact updates.<n>We propose textbfFederated textbfLow-textbfRank textbfAggregation with textbfNearly textbfAccurate Estimation (FLoRA-NA)<n>FLoRA-NA bridges the gap between local personalization and global generalization, addressing a key limitation of prior personalized FedLoRA approaches.
arXiv Detail & Related papers (2025-09-30T15:32:26Z)
Proving the Limited Scalability of Centralized Distributed Optimization via a New Lower Bound Construction [57.93371273485736]
We consider a centralized distributed learning setup where all workers jointly find an unbiased bound LDeltaepsilon2,$ better poly-logarithmically in $n$, even in the homogeneous (i.i.d.) case, where all workers access the same distribution.
arXiv Detail & Related papers (2025-06-30T13:27:39Z)
FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training [51.39495282347475]
We introduce $textttFRUGAL$ ($textbfF$ull-$textbfR$ank $textbfU$pdates with $textbfG$r$textbfA$dient sp$textbfL$itting, a new memory-efficient optimization framework.<n>Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam.
arXiv Detail & Related papers (2024-11-12T14:41:07Z)
ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training [22.940404796500985]
We propose a memory-efficient optimization algorithm for distributed training LLMs.<n>By synchronizing delayed gradients while computing new ones, ACCO reduces GPU idle time and supports heterogeneous hardware.<n>Compared to ZeRO-1, our approach is significantly faster and scales effectively across heterogeneous hardware.
arXiv Detail & Related papers (2024-06-03T08:23:45Z)
Transfer Q Star: Principled Decoding for LLM Alignment [105.89114186982972]
Transfer $Q*$ estimates the optimal value function for a target reward $r$ through a baseline model. Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods.
arXiv Detail & Related papers (2024-05-30T21:36:12Z)
$\ extbf{A}^2\ extbf{CiD}^2$: Accelerating Asynchronous Communication in Decentralized Deep Learning [0.0]
We introduce a principled asynchronous, randomized, gossip-based optimization algorithm which works thanks to a continuous local momentum named $textbfA2textbfCiD2$. Our theoretical analysis proves accelerated rates compared to previous asynchronous decentralized baselines. We show consistent improvement on the ImageNet dataset using up to 64 asynchronous workers.
arXiv Detail & Related papers (2023-06-14T06:52:07Z)
Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching. Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z)
Faster Non-Convex Federated Learning via Global and Local Momentum [57.52663209739171]
textttFedGLOMO is the first (first-order) FLtexttFedGLOMO algorithm. Our algorithm is provably optimal even with communication between the clients and the server.
arXiv Detail & Related papers (2020-12-07T21:05:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.