Distributed Training under Packet Loss
- URL: http://arxiv.org/abs/2507.07114v1
- Date: Wed, 02 Jul 2025 11:07:20 GMT
- Title: Distributed Training under Packet Loss
- Authors: Erez Weintraub, Ron Banner, Ariel Orda,
- Abstract summary: Leveraging unreliable connections will reduce latency but may sacrifice model accuracy and convergence once packets are dropped.<n>We introduce a principled, end-to-end solution that preserves accuracy and convergence guarantees under genuine packet loss.<n>This work bridges the gap between communication-efficient protocols and the accuracy and guarantees demanded by modern large-model training.
- Score: 8.613477072763404
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: State-of-the-art language and vision models are routinely trained across thousands of GPUs, often spanning multiple data-centers, yet today's distributed frameworks still assume reliable connections (e.g., InfiniBand or RoCE). The resulting acknowledgment traffic and retransmissions inflate tail latencies and limit scalability. Leveraging unreliable connections will reduce latency but may sacrifice model accuracy and convergence once packets are dropped. A principled, end-to-end solution that preserves accuracy and convergence guarantees under genuine packet loss has previously been missing. We address this critical gap by introducing a novel distributed training framework capable of operating over unreliable connections, offering unbiased gradient aggregation and bounded parameter drift without modifying model code or optimizers. The key insight is a two-stage defense against missing messages: (i) Unbiased gradient aggregation: each worker reconstructs a consistent gradient estimate from whatever packets arrive, guaranteeing expectation-level correctness; and (ii) Bounded-drift parameter broadcasts: we prove the inter-worker model discrepancy remains O(1) even after arbitrarily many iterations, preventing the unbounded divergence typical of asynchronous setups. Analytical bounds are matched by experiments on the LLAMA2 7B model with 64 GPUs: tolerating 10% random packet loss yields at most 0.8% perplexity change. This work bridges the gap between communication-efficient datacenter protocols and the accuracy and generalization guarantees demanded by modern large-model training, enabling robust, high-throughput learning on commodity or wide-area networks.
Related papers
- SPIRE: Conditional Personalization for Federated Diffusion Generative Models [7.8583640700306585]
Shared Backbone Personal Identity Representation Embeddings (SPIRE) is a framework that casts per client diffusion based generation as conditional generation in FL.<n>SPIRE factorizes the network into (i) a high capacity global backbone that learns a population level score function and (ii) lightweight, learnable client embeddings that encode local data statistics.<n>Our analysis also hints at how client embeddings act as biases that steer a shared score network toward personalized distributions.
arXiv Detail & Related papers (2025-06-14T01:40:31Z) - Adaptive Deadline and Batch Layered Synchronized Federated Learning [66.93447103966439]
Federated learning (FL) enables collaborative model training across distributed edge devices while preserving data privacy, and typically operates in a round-based synchronous manner.<n>We propose ADEL-FL, a novel framework that jointly optimize per-round deadlines and user-specific batch sizes for layer-wise aggregation.
arXiv Detail & Related papers (2025-05-29T19:59:18Z) - Lightweight Federated Learning with Differential Privacy and Straggler Resilience [19.94124499453864]
Federated learning (FL) enables collaborative model training through model parameter exchanges instead of raw data.<n>To avoid potential inference attacks from exchanged parameters, differential privacy (DP) offers rigorous guarantee against various attacks.<n>We propose LightDP-FL, a novel lightweight scheme that ensures provable DP against untrusted peers and server.
arXiv Detail & Related papers (2024-12-09T00:54:00Z) - Federated Smoothing Proximal Gradient for Quantile Regression with Non-Convex Penalties [3.269165283595478]
Distributed sensors in the internet-of-things (IoT) generate vast amounts of sparse data.
We propose a federated smoothing proximal gradient (G) algorithm that integrates a smoothing mechanism with the view, thereby both precision and computational speed.
arXiv Detail & Related papers (2024-08-10T21:50:19Z) - PeFAD: A Parameter-Efficient Federated Framework for Time Series Anomaly Detection [51.20479454379662]
We propose a.
Federated Anomaly Detection framework named PeFAD with the increasing privacy concerns.
We conduct extensive evaluations on four real datasets, where PeFAD outperforms existing state-of-the-art baselines by up to 28.74%.
arXiv Detail & Related papers (2024-06-04T13:51:08Z) - Asynchronous Federated Stochastic Optimization for Heterogeneous Objectives Under Arbitrary Delays [0.0]
Federated learning (FL) was recently proposed to securely train models with data held over multiple locations ("clients")
Two major challenges hindering the performance of FL algorithms are long training times caused by straggling clients, and a decline in model accuracy under non-iid local data distributions ("client drift")
We propose and analyze Asynchronous Exact Averaging (AREA), a new (sub)gradient algorithm that utilizes communication to speed up convergence and enhance scalability, and employs client memory to correct the client drift caused by variations in client update frequencies.
arXiv Detail & Related papers (2024-05-16T14:22:49Z) - Adaptive Model Pruning and Personalization for Federated Learning over
Wireless Networks [72.59891661768177]
Federated learning (FL) enables distributed learning across edge devices while protecting data privacy.
We consider a FL framework with partial model pruning and personalization to overcome these challenges.
This framework splits the learning model into a global part with model pruning shared with all devices to learn data representations and a personalized part to be fine-tuned for a specific device.
arXiv Detail & Related papers (2023-09-04T21:10:45Z) - Quantized Distributed Training of Large Models with Convergence
Guarantees [34.054462975511996]
We present QSDP, a variant of FSDP which supports both quant and weight gradientization with theoretical guarantees.
We show that QSDP preserves accuracy, while completely removing the communication of FSDP, providing the speed-to-endups of up to 2.2x.
arXiv Detail & Related papers (2023-02-05T14:20:55Z) - Low-Latency Federated Learning over Wireless Channels with Differential
Privacy [142.5983499872664]
In federated learning (FL), model training is distributed over clients and local models are aggregated by a central server.
In this paper, we aim to minimize FL training delay over wireless channels, constrained by overall training performance as well as each client's differential privacy (DP) requirement.
arXiv Detail & Related papers (2021-06-20T13:51:18Z) - Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned
Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network.
We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z) - Uncertainty Estimation Using a Single Deep Deterministic Neural Network [66.26231423824089]
We propose a method for training a deterministic deep model that can find and reject out of distribution data points at test time with a single forward pass.
We scale training in these with a novel loss function and centroid updating scheme and match the accuracy of softmax models.
arXiv Detail & Related papers (2020-03-04T12:27:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.