LIBRA: Enabling Workload-aware Multi-dimensional Network Topology Optimization for Distributed Training of Large AI Models
- URL: http://arxiv.org/abs/2109.11762v2
- Date: Sun, 5 May 2024 05:53:40 GMT
- Title: LIBRA: Enabling Workload-aware Multi-dimensional Network Topology Optimization for Distributed Training of Large AI Models
- Authors: William Won, Saeed Rashidi, Sudarshan Srinivasan, Tushar Krishna,
- Abstract summary: We motivate the design of multi-dimensional networks within machine learning systems as a cost-efficient mechanism to enhance overall network bandwidth.
We introduce LIBRA, a framework specifically focused on optimizing multi-dimensional fabric architectures.
- Score: 6.980277221943408
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As model sizes in machine learning continue to scale, distributed training is necessary to accommodate model weights within each device and to reduce training time. However, this comes with the expense of increased communication overhead due to the exchange of gradients and activations, which become the critical bottleneck of the end-to-end training process. In this work, we motivate the design of multi-dimensional networks within machine learning systems as a cost-efficient mechanism to enhance overall network bandwidth. We also identify that optimal bandwidth allocation is pivotal for multi-dimensional networks to ensure efficient resource utilization. We introduce LIBRA, a framework specifically focused on optimizing multi-dimensional fabric architectures. Through case studies, we demonstrate the value of LIBRA, both in architecting optimized fabrics under diverse constraints and in enabling co-optimization opportunities.
Related papers
- Federated Split Learning with Model Pruning and Gradient Quantization in Wireless Networks [7.439160287320074]
Federated split learning (FedSL) implements collaborative training across the edge devices and the server through model splitting.
We propose a lightweight FedSL scheme, that further alleviates the training burden on resource-constrained edge devices.
We conduct theoretical analysis to quantify the convergence performance of the proposed scheme.
arXiv Detail & Related papers (2024-12-09T11:43:03Z) - Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z) - RTF-Q: Efficient Unsupervised Domain Adaptation with Retraining-free Quantization [14.447148108341688]
We propose efficient unsupervised domain adaptation with ReTraining-Free Quantization (RTF-Q)
Our approach uses low-precision quantization architectures with varying computational costs, adapting to devices with dynamic budgets.
We demonstrate that our network achieves competitive accuracy with state-of-the-art methods across three benchmarks.
arXiv Detail & Related papers (2024-08-11T11:53:29Z) - RL-MUL 2.0: Multiplier Design Optimization with Parallel Deep Reinforcement Learning and Space Reduction [8.093985979285533]
We propose a multiplier design optimization framework based on reinforcement learning.
We utilize matrix and tensor representations for the compressor tree of a multiplier, enabling seamless integration of convolutional neural networks as the agent network.
Experiments conducted on different bit widths of multipliers demonstrate that multipliers produced by our approach outperform all baseline designs in terms of area, power, and delay.
arXiv Detail & Related papers (2024-03-31T10:43:33Z) - When Computing Power Network Meets Distributed Machine Learning: An
Efficient Federated Split Learning Framework [6.871107511111629]
CPN-FedSL is a Federated Split Learning (FedSL) framework over Computing Power Network (CPN)
We build a dedicated model to capture the basic settings and learning characteristics (e.g., latency, flow, convergence)
arXiv Detail & Related papers (2023-05-22T12:36:52Z) - Vertical Federated Learning over Cloud-RAN: Convergence Analysis and
System Optimization [82.12796238714589]
We propose a novel cloud radio access network (Cloud-RAN) based vertical FL system to enable fast and accurate model aggregation.
We characterize the convergence behavior of the vertical FL algorithm considering both uplink and downlink transmissions.
We establish a system optimization framework by joint transceiver and fronthaul quantization design, for which successive convex approximation and alternate convex search based system optimization algorithms are developed.
arXiv Detail & Related papers (2023-05-04T09:26:03Z) - Vertical Layering of Quantized Neural Networks for Heterogeneous
Inference [57.42762335081385]
We study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one.
We can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model.
arXiv Detail & Related papers (2022-12-10T15:57:38Z) - All at Once Network Quantization via Collaborative Knowledge Transfer [56.95849086170461]
We develop a novel collaborative knowledge transfer approach for efficiently training the all-at-once quantization network.
Specifically, we propose an adaptive selection strategy to choose a high-precision enquoteteacher for transferring knowledge to the low-precision student.
To effectively transfer knowledge, we develop a dynamic block swapping method by randomly replacing the blocks in the lower-precision student network with the corresponding blocks in the higher-precision teacher network.
arXiv Detail & Related papers (2021-03-02T03:09:03Z) - Edge-assisted Democratized Learning Towards Federated Analytics [67.44078999945722]
We show the hierarchical learning structure of the proposed edge-assisted democratized learning mechanism, namely Edge-DemLearn.
We also validate Edge-DemLearn as a flexible model training mechanism to build a distributed control and aggregation methodology in regions.
arXiv Detail & Related papers (2020-12-01T11:46:03Z) - On the Difficulty of Designing Processor Arrays for Deep Neural Networks [0.0]
Camuy is a lightweight model of a weight-stationary systolic array for linear algebra operations.
We present an analysis of popular models to illustrate how it can estimate required cycles, data movement costs, as well as systolic array utilization.
arXiv Detail & Related papers (2020-06-24T19:24:08Z) - Efficient Crowd Counting via Structured Knowledge Transfer [122.30417437707759]
Crowd counting is an application-oriented task and its inference efficiency is crucial for real-world applications.
We propose a novel Structured Knowledge Transfer framework to generate a lightweight but still highly effective student network.
Our models obtain at least 6.5$times$ speed-up on an Nvidia 1080 GPU and even achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-03-23T08:05:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.