ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems
for Large-model Training at Scale
- URL: http://arxiv.org/abs/2303.14006v1
- Date: Fri, 24 Mar 2023 14:00:18 GMT
- Title: ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems
for Large-model Training at Scale
- Authors: William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan,
Sudarshan Srinivasan, Tushar Krishna
- Abstract summary: We extend the open-source ASTRA-sim infrastructure to model state-of-the-art and emerging distributed training models and platforms.
We run comprehensive case studies targeting emerging distributed models and platforms.
- Score: 5.217665236079274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As deep learning models and input data are scaling at an unprecedented rate,
it is inevitable to move towards distributed training platforms to fit the
model and increase training throughput. State-of-the-art approaches and
techniques, such as wafer-scale nodes, multi-dimensional network topologies,
disaggregated memory systems, and parallelization strategies, have been
actively adopted by emerging distributed training systems. This results in a
complex SW/HW co-design stack of distributed training, necessitating a
modeling/simulation infrastructure for design-space exploration. In this paper,
we extend the open-source ASTRA-sim infrastructure and endow it with the
capabilities to model state-of-the-art and emerging distributed training models
and platforms. More specifically, (i) we enable ASTRA-sim to support arbitrary
model parallelization strategies via a graph-based training-loop
implementation, (ii) we implement a parameterizable multi-dimensional
heterogeneous topology generation infrastructure with analytical performance
estimates enabling simulating target systems at scale, and (iii) we enhance the
memory system modeling to support accurate modeling of in-network collective
communication and disaggregated memory systems. With such capabilities, we run
comprehensive case studies targeting emerging distributed models and platforms.
This infrastructure lets system designers swiftly traverse the complex
co-design stack and give meaningful insights when designing and deploying
distributed training platforms at scale.
Related papers
- Flow-Through Tensors: A Unified Computational Graph Architecture for Multi-Layer Transportation Network Optimization [20.685856719515026]
Flow Throughs (FTT) is a unified computational graph architecture that connects origin destination flows, path, probabilities and link travel times as interconnected tensors.<n>Our framework makes three key contributions: first, it establishes a consistent mathematical structure that enables gradient-based optimization across previously separate modeling elements.<n>Second, it supports multidimensional analysis of traffic patterns over time, space, and user groups with precise quantification of system efficiency.
arXiv Detail & Related papers (2025-06-30T06:42:23Z) - Transferable Post-training via Inverse Value Learning [83.75002867411263]
We propose modeling changes at the logits level during post-training using a separate neural network (i.e., the value network)
After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference.
We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes.
arXiv Detail & Related papers (2024-10-28T13:48:43Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Vertical Federated Learning over Cloud-RAN: Convergence Analysis and
System Optimization [82.12796238714589]
We propose a novel cloud radio access network (Cloud-RAN) based vertical FL system to enable fast and accurate model aggregation.
We characterize the convergence behavior of the vertical FL algorithm considering both uplink and downlink transmissions.
We establish a system optimization framework by joint transceiver and fronthaul quantization design, for which successive convex approximation and alternate convex search based system optimization algorithms are developed.
arXiv Detail & Related papers (2023-05-04T09:26:03Z) - On Optimizing the Communication of Model Parallelism [74.15423270435949]
We study a novel and important communication pattern in large-scale model-parallel deep learning (DL)
In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh.
We propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule.
arXiv Detail & Related papers (2022-11-10T03:56:48Z) - State-driven Implicit Modeling for Sparsity and Robustness in Neural
Networks [3.604879434384177]
We present a new approach to training implicit models, called State-driven Implicit Modeling (SIM)
SIM constrains the internal states and outputs to match that of a baseline model, circumventing costly backward computations.
We demonstrate how the SIM approach can be applied to significantly improve sparsity and robustness of baseline models trained on datasets.
arXiv Detail & Related papers (2022-09-19T23:58:48Z) - Edge-assisted Democratized Learning Towards Federated Analytics [67.44078999945722]
We show the hierarchical learning structure of the proposed edge-assisted democratized learning mechanism, namely Edge-DemLearn.
We also validate Edge-DemLearn as a flexible model training mechanism to build a distributed control and aggregation methodology in regions.
arXiv Detail & Related papers (2020-12-01T11:46:03Z) - S2RMs: Spatially Structured Recurrent Modules [105.0377129434636]
We take a step towards exploiting dynamic structure that are capable of simultaneously exploiting both modular andtemporal structures.
We find our models to be robust to the number of available views and better capable of generalization to novel tasks without additional training.
arXiv Detail & Related papers (2020-07-13T17:44:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.