Related papers: Communication-Efficient Distributed Training for Collaborative Flat Optima Recovery in Deep Learning

Communication-Efficient Distributed Training for Collaborative Flat Optima Recovery in Deep Learning

URL: http://arxiv.org/abs/2507.20424v2
Date: Fri, 10 Oct 2025 00:35:34 GMT
Title: Communication-Efficient Distributed Training for Collaborative Flat Optima Recovery in Deep Learning
Authors: Tolga Dimlioglu, Anna Choromanska,
Abstract summary: We study distributed data parallel of deep neural networks (DNNs) to improve the trade-off between communication efficiency and model performance.<n>We introduce a simple, yet effective, sharpness measure, Inverse Mean Valley, demonstrate its strong correlation with generalization of DNNs.<n>We show that DPPF outperforms other communication-efficient approaches and achieves better generalization performance than local methods and gradient averaging.
Score: 9.245468958723182
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study centralized distributed data parallel training of deep neural networks (DNNs), aiming to improve the trade-off between communication efficiency and model performance of the local gradient methods. To this end, we revisit the flat-minima hypothesis, which suggests that models with better generalization tend to lie in flatter regions of the loss landscape. We introduce a simple, yet effective, sharpness measure, Inverse Mean Valley, and demonstrate its strong correlation with the generalization gap of DNNs. We incorporate an efficient relaxation of this measure into the distributed training objective as a lightweight regularizer that encourages workers to collaboratively seek wide minima. The regularizer exerts a pushing force that counteracts the consensus step pulling the workers together, giving rise to the Distributed Pull-Push Force (DPPF) algorithm. Empirically, we show that DPPF outperforms other communication-efficient approaches and achieves better generalization performance than local gradient methods and synchronous gradient averaging, while maintaining communication efficiency. In addition, our loss landscape visualizations confirm the ability of DPPF to locate flatter minima. On the theoretical side, we show that DPPF guides workers to span flat valleys, with the final valley width governed by the interplay between push and pull strengths, and that its pull-push dynamics is self-stabilizing. We further provide generalization guarantees linked to the valley width and prove convergence in the non-convex setting.

Related papers

SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training [54.8494905524997]
Decentralized training introduces critical security risks when executed across untrusted, geographically distributed nodes.<n>We propose SENTINEL, a verification mechanism for pipeline parallelism (PP) training without duplication.<n>Experiments demonstrate successful training of up to 4B- parameter LLMs across untrusted distributed environments with up to 176 workers while maintaining model convergence and performance.
arXiv Detail & Related papers (2026-03-03T23:51:10Z)
Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies [51.24079409973799]
Diffusion-based generative models are well-positioned to meet the needs of online Multi-Agent Reinforcement Learning (MARL)<n>We propose among the first underlineOnline off-policy underlineMARL framework using underlineDiffusion policies to orchestrate coordination.<n>Our key innovation is a relaxed policy objective that maximizes scaled joint entropy, facilitating effective exploration without relying on tractable likelihood.
arXiv Detail & Related papers (2026-02-20T15:38:02Z)
Local adapt-then-combine algorithms for distributed nonsmooth optimization: Achieving provable communication acceleration [50.67878993903822]
We propose a communication-efficient Adapt-Then-Combine (ATC) framework, FlexATC, unifying numerous ATC-based distributed algorithms.<n>We show for the first time that local updates provably lead to communication acceleration for ATC-based distributed algorithms.
arXiv Detail & Related papers (2026-02-18T02:47:05Z)
Strategies for Improving Communication Efficiency in Distributed and Federated Learning: Compression, Local Training, and Personalization [8.579148218325168]
dissertation explores strategies to improve communication efficiency, focusing on model compression, local training, and personalization.<n>We establish a unified framework for biased and unbiased compression operators with convergence guarantees.<n>We propose adaptive local training strategies that incorporate personalization to accelerate convergence and mitigate client drift.
arXiv Detail & Related papers (2025-09-10T02:19:56Z)
Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [65.36449542323277]
We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
arXiv Detail & Related papers (2025-06-15T05:42:29Z)
Decentralized Federated Learning with Gradient Tracking over Time-Varying Directed Networks [42.92231921732718]
We propose a consensus-based algorithm called DSGTm-TV. It incorporates gradient tracking and heavy-ball momentum to optimize a global objective function. Under DSGTm-TV, agents will update local model parameters and gradient estimates using information exchange with neighboring agents.
arXiv Detail & Related papers (2024-09-25T06:23:16Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
Decentralized Directed Collaboration for Personalized Federated Learning [39.29794569421094]
We concentrate on the Decentralized Personalized Learning (DPFL) that performs distributed training model computation. We propose a directed collaboration framework by incorporating textbfDecentralized textbfFederated textbfPartial textbfGradient textbfPedGP.
arXiv Detail & Related papers (2024-05-28T06:52:19Z)
DRAG: Divergence-based Adaptive Aggregation in Federated learning on Non-IID Data [11.830891255837788]
Local gradient descent (SGD) is a fundamental approach in achieving communication efficiency in Federated Learning (FL) We introduce a novel metric called degree of divergence," quantifying the angle between the local gradient and the global reference direction. We propose the divergence-based adaptive aggregation (DRAG) algorithm, which dynamically drags" the received local updates toward the reference direction in each round without requiring extra communication overhead.
arXiv Detail & Related papers (2023-09-04T19:40:58Z)
Magnitude Matters: Fixing SIGNSGD Through Magnitude-Aware Sparsification in the Presence of Data Heterogeneity [60.791736094073]
Communication overhead has become one of the major bottlenecks in the distributed training of deep neural networks. We propose a magnitude-driven sparsification scheme, which addresses the non-convergence issue of SIGNSGD. The proposed scheme is validated through experiments on Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets.
arXiv Detail & Related papers (2023-02-19T17:42:35Z)
FedLAP-DP: Federated Learning by Sharing Differentially Private Loss Approximations [53.268801169075836]
We propose FedLAP-DP, a novel privacy-preserving approach for federated learning. A formal privacy analysis demonstrates that FedLAP-DP incurs the same privacy costs as typical gradient-sharing schemes. Our approach presents a faster convergence speed compared to typical gradient-sharing methods.
arXiv Detail & Related papers (2023-02-02T12:56:46Z)
Analyzing the Effect of Sampling in GNNs on Individual Fairness [79.28449844690566]
Graph neural network (GNN) based methods have saturated the field of recommender systems. We extend an existing method for promoting individual fairness on graphs to support mini-batch, or sub-sample based, training of a GNN. We show that mini-batch training facilitate individual fairness promotion by allowing for local nuance to guide the process of fairness promotion in representation learning.
arXiv Detail & Related papers (2022-09-08T16:20:25Z)
Distributed Adversarial Training to Robustify Deep Neural Networks at Scale [100.19539096465101]
Current deep neural networks (DNNs) are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification. To defend against such attacks, an effective approach, known as adversarial training (AT), has been shown to mitigate robust training. We propose a large-batch adversarial training framework implemented over multiple machines.
arXiv Detail & Related papers (2022-06-13T15:39:43Z)
Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization [21.81192774458227]
One of the major bottlenecks is the large communication cost between the central server and the local workers. Our proposed distributed learning framework features an effective gradient gradient compression strategy.
arXiv Detail & Related papers (2021-11-01T04:54:55Z)
Intermittent Pulling with Local Compensation for Communication-Efficient Federated Learning [20.964434898554344]
Federated Learning is a powerful machine learning paradigm to train a global model with highly distributed data. A major bottleneck on the performance of distributed SGD is the communication overhead on pushing local and pulling global model. We propose a novel approach named Gradient Pulling Compensation (PRLC) to reduce communication overhead.
arXiv Detail & Related papers (2020-01-22T20:53:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.