FlexDeMo: Decoupled Momentum Optimization for Fully and Hybrid Sharded Training
- URL: http://arxiv.org/abs/2502.06728v1
- Date: Mon, 10 Feb 2025 17:55:59 GMT
- Title: FlexDeMo: Decoupled Momentum Optimization for Fully and Hybrid Sharded Training
- Authors: Mogens Henrik From, Jacob Nielsen, Lukas Galke, Peter Schneider-Kamp,
- Abstract summary: Training large neural network models requires extensive computational resources.
Recent findings suggest that it may be sufficient to only exchange the fast moving components of the gradients.
We propose employing a hybrid strategy, FlexDeMo, whereby nodes fully synchronize locally between different GPU.
- Score: 5.191183730031093
- License:
- Abstract: Training large neural network models requires extensive computational resources, often distributed across several nodes and accelerators. Recent findings suggest that it may be sufficient to only exchange the fast moving components of the gradients, while accumulating momentum locally (Decoupled Momentum, or DeMo). However, when considering larger models that do not fit on a single accelerate, the exchange of gradient information and the integration of DeMo needs to be reconsidered. Here, we propose employing a hybrid strategy, FlexDeMo, whereby nodes fully synchronize locally between different GPUs and inter-node communication is improved through only using the fast-moving components. This effectively combines previous hybrid sharding strategies with the advantages of decoupled momentum. Our experimental results show that FlexDeMo is on par with AdamW in terms of validation loss, demonstrating its viability.
Related papers
- DeMo: Decoupled Momentum Optimization [6.169574689318864]
Training large neural networks typically requires sharing between accelerators through specialized high-speed interconnects.
We introduce bfDecoupled textbfMomentum (DeMo), a fused magnitude and data parallel algorithm that reduces inter-accelerator communication requirements.
Empirical results show that models trained with DeMo match or exceed the performance of equivalent models trained with AdamW.
arXiv Detail & Related papers (2024-11-29T17:31:47Z) - MobileMamba: Lightweight Multi-Receptive Visual Mamba Network [51.33486891724516]
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs.
We propose the MobileMamba framework, which balances efficiency and performance.
MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods.
arXiv Detail & Related papers (2024-11-24T18:01:05Z) - Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging [111.8456671452411]
Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer.
We propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging.
We show that WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.
arXiv Detail & Related papers (2024-10-29T07:16:31Z) - Unlocking FedNL: Self-Contained Compute-Optimized Implementation [56.16884466478886]
Federated Learning (FL) is an emerging paradigm that enables intelligent agents to collaboratively train Machine Learning (ML) models in a distributed manner.
Recent work introduces a family of Federated Newton Learn (FedNL) algorithms, marking a significant step towards applying second-order methods to FL and large-scale optimization.
We present a self-contained implementation of FedNL, FedNL-LS, FedNL-PP for single-node and multi-node settings.
arXiv Detail & Related papers (2024-10-11T12:19:18Z) - LocMoE: A Low-Overhead MoE for Large Language Model Training [13.153904674287546]
We propose a novel routing strategy that combines load balance and locality by converting partial inter-node communication to that of intra-node.
The proposed LocMoE reduces training time per epoch by 12.68% to 22.24% compared to classical routers.
arXiv Detail & Related papers (2024-01-25T03:36:39Z) - Straggler-resilient Federated Learning: Tackling Computation
Heterogeneity with Layer-wise Partial Model Training in Mobile Edge Network [4.1813760301635705]
We propose Federated Partial Model Training (FedPMT), where devices with smaller computational capabilities work on partial models and contribute to the global model.
As such, all devices in FedPMT prioritize the most crucial parts of the global model.
Empirical results show that FedPMT significantly outperforms the existing benchmark FedDrop.
arXiv Detail & Related papers (2023-11-16T16:30:04Z) - Decouple Graph Neural Networks: Train Multiple Simple GNNs Simultaneously Instead of One [60.5818387068983]
Graph neural networks (GNN) suffer from severe inefficiency.
We propose to decouple a multi-layer GNN as multiple simple modules for more efficient training.
We show that the proposed framework is highly efficient with reasonable performance.
arXiv Detail & Related papers (2023-04-20T07:21:32Z) - Multi-Carrier NOMA-Empowered Wireless Federated Learning with Optimal
Power and Bandwidth Allocation [31.80744279032665]
Wireless federated learning (WFL) undergoes a bottleneck communication in uplink, limiting the number of users that can upload their local models in each global aggregation round.
This paper presents a new multi-carrier non-orthogonal multiple-access (MC-NOMA) WFL that allows the users to train different numbers of iterations per round.
As corroborated using a convolutional neural network and an 18-layer residential network, the proposed MC-NOMA WFL can efficiently reduce communication, increase local model training times, and accelerate the convergence by over 40%, compared to its existing alternative.
arXiv Detail & Related papers (2023-02-13T22:41:14Z) - AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for
Efficient Neural Machine Translation [104.0979785739202]
Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in Neural Machine Translation (NMT) tasks.
Existing MoE models mostly consider a homogeneous design where the same number of experts of the same size are placed uniformly throughout the network.
We develop AutoMoE -- a framework for designing heterogeneous MoE's under computational constraints.
arXiv Detail & Related papers (2022-10-14T05:32:17Z) - Predictive GAN-powered Multi-Objective Optimization for Hybrid Federated
Split Learning [56.125720497163684]
We propose a hybrid federated split learning framework in wireless networks.
We design a parallel computing scheme for model splitting without label sharing, and theoretically analyze the influence of the delayed gradient caused by the scheme on the convergence speed.
arXiv Detail & Related papers (2022-09-02T10:29:56Z) - Tutel: Adaptive Mixture-of-Experts at Scale [20.036168971435306]
Sparsely-gated mixture-of-experts (MoE) has been widely adopted to scale deep learning models to trillion-plus parameters with fixed computational cost.
We present Flex, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining.
Our evaluation shows that Flex efficiently and effectively runs a real-world MoE-based model named SwinV2-MoE, built upon Swin Transformer V2, a state-of-the-art computer vision architecture.
arXiv Detail & Related papers (2022-06-07T15:20:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.