Related papers: TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning

TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning

URL: http://arxiv.org/abs/2304.05301v3
Date: Wed, 02 Oct 2024 19:15:05 GMT
Title: TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning
Authors: William Won, Midhilesh Elavazhagan, Sudarshan Srinivasan, Swati Gupta, Tushar Krishna,
Abstract summary: This paper presents TACOS, an autonomous synthesizer capable of automatically generating topology-aware collective algorithms. TACOS is highly flexible, synthesizing an All-Reduce algorithm for a heterogeneous 128-NPU system in just 1.08 seconds. It achieves up to a 4.27x performance improvement over state-of-the-art synthesizers.
Score: 9.196825913937472
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The surge of artificial intelligence, particularly large language models, has driven the rapid development of large-scale machine learning clusters. Executing distributed models on these clusters is often constrained by communication overhead, making efficient utilization of available network resources crucial. As a result, the routing algorithm employed for collective communications (i.e., collective algorithms) plays a pivotal role in determining overall performance. Unfortunately, existing collective communication libraries for distributed machine learning are limited by a fixed set of basic collective algorithms. This limitation hinders communication optimization, especially in modern clusters with heterogeneous and asymmetric topologies. Furthermore, manually designing collective algorithms for all possible combinations of network topologies and collective patterns requires heavy engineering and validation efforts. To address these challenges, this paper presents TACOS, an autonomous synthesizer capable of automatically generating topology-aware collective algorithms tailored to specific collective patterns and network topologies. TACOS is highly flexible, synthesizing an All-Reduce algorithm for a heterogeneous 128-NPU system in just 1.08 seconds, while achieving up to a 4.27x performance improvement over state-of-the-art synthesizers. Additionally, TACOS demonstrates better scalability with polynomial synthesis times, in contrast to NP-hard approaches which only scale to systems with tens of NPUs. TACOS can synthesize for 40K NPUs in just 2.52 hours.

Related papers

Traffic Engineering in Large-scale Networks with Generalizable Graph Neural Networks [19.36374721098885]
TELGEN is a novel TE algorithm that learns to solve TE problems efficiently in large-scale networks. We trained and evaluated TELGEN on random and real-world networks with up to 5000 nodes and 106 links.
arXiv Detail & Related papers (2025-03-31T15:21:22Z)
NAR-*ICP: Neural Execution of Classical ICP-based Pointcloud Registration Algorithms [7.542220697870245]
This study explores the intersection of neural networks and classical robotics algorithms through the Neural Algorithmic Reasoning framework. We propose a Graph Neural Network (GNN)-based learning framework, NAR-*ICP, which learns the intermediate algorithmic steps of classical ICP-based pointcloud registration algorithms. We evaluate our approach across diverse datasets, from real-world to synthetic, demonstrating its flexibility in handling complex and noisy inputs.
arXiv Detail & Related papers (2024-10-14T19:33:46Z)
CORE: Common Random Reconstruction for Distributed Optimization with Provable Low Communication Complexity [110.50364486645852]
Communication complexity has become a major bottleneck for speeding up training and scaling up machine numbers. We propose Common Om REOm, which can be used to compress information transmitted between machines.
arXiv Detail & Related papers (2023-09-23T08:45:27Z)
Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields. Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices. We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z)
Faster Adaptive Momentum-Based Federated Methods for Distributed Composition Optimization [14.579475552088692]
We propose a class of faster federated composition optimization algorithms (i.e. MFCGD and AdaMFCGD) to solve the non distributed composition problems. In particular, our adaptive algorithm (i.e., AdaMFCGD) uses a unified adaptive matrix to flexibly incorporate various adaptive learning rates.
arXiv Detail & Related papers (2022-11-03T15:17:04Z)
FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs) Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z)
Time-Correlated Sparsification for Efficient Over-the-Air Model Aggregation in Wireless Federated Learning [23.05003652536773]
Federated edge learning (FEEL) is a promising distributed machine learning (ML) framework to drive edge intelligence applications. We propose time-correlated sparsification with hybrid aggregation (TCS-H) for communication-efficient FEEL.
arXiv Detail & Related papers (2022-02-17T02:48:07Z)
Efficient Direct-Connect Topologies for Collective Communications [2.9394897655215555]
We provide an algorithmic framework for constructing direct-connect topologies optimized for the latency vs. bandwidth trade-off associated with the workload. Our approach synthesizes many different topologies and schedules for a given cluster size and degree and then identifies the appropriate topology and schedule for a given workload.
arXiv Detail & Related papers (2022-02-07T16:59:05Z)
Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL [1.5528708400965123]
We present TACCL, a synthesizer for collective communication primitives for large-scale multi-GPU systems. TACCL encodes a profiled topology and input size into a synthesis problem to generate optimized communication algorithms. Using TACCL's algorithms speeds up the end-to-end training of an internal mixture of experts model by $17%$.
arXiv Detail & Related papers (2021-11-08T23:20:52Z)
Clustered Federated Learning via Generalized Total Variation Minimization [83.26141667853057]
We study optimization methods to train local (or personalized) models for local datasets with a decentralized network structure. Our main conceptual contribution is to formulate federated learning as total variation minimization (GTV) Our main algorithmic contribution is a fully decentralized federated learning algorithm.
arXiv Detail & Related papers (2021-05-26T18:07:19Z)
A Low Complexity Decentralized Neural Net with Centralized Equivalence using Layer-wise Learning [49.15799302636519]
We design a low complexity decentralized learning algorithm to train a recently proposed large neural network in distributed processing nodes (workers) In our setup, the training data is distributed among the workers but is not shared in the training process due to privacy and security concerns. We show that it is possible to achieve equivalent learning performance as if the data is available in a single place.
arXiv Detail & Related papers (2020-09-29T13:08:12Z)
Self-organizing Democratized Learning: Towards Large-scale Distributed Learning Systems [71.14339738190202]
democratized learning (Dem-AI) lays out a holistic philosophy with underlying principles for building large-scale distributed and democratized machine learning systems. Inspired by Dem-AI philosophy, a novel distributed learning approach is proposed in this paper. The proposed algorithms demonstrate better results in the generalization performance of learning models in agents compared to the conventional FL algorithms.
arXiv Detail & Related papers (2020-07-07T08:34:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.