Related papers: T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

URL: http://arxiv.org/abs/2401.16677v1
Date: Tue, 30 Jan 2024 01:55:34 GMT
Title: T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives
Authors: Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena and Matthew D. Sinclair
Abstract summary: Large Language Models increasingly rely on distributed techniques for their training and inference. Such techniques inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. We propose T3, which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute.
Score: 1.908240145212707
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy. To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention, and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in $\sim$500-billion parameter models, PALM and MT-NLG.

Related papers

FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation [6.284874558004134]
We propose FlashOverlap, a lightweight design characterized by tile-wise overlapping, interference-free computation, and communication agnosticism. Experiments show that such a lightweight design achieves up to 1.65x speedup, outperforming existing works in most cases.
arXiv Detail & Related papers (2025-04-28T06:37:57Z)
Sparsity-Aware Communication for Distributed Graph Neural Network Training [0.41942958779358674]
Graph Neural Networks (GNNs) are a computationally efficient method to learn embeddings and classifications on graph data. GNN training has low computational intensity, making communication costs the bottleneck for scalability. We develop sparsity-aware algorithms that tackle the communication bottlenecks in GNN training with three novel approaches.
arXiv Detail & Related papers (2025-04-07T01:53:14Z)
Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping [36.71999572939612]
We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models. Applying Ladder Residual to all its layers can achieve 29% end-to-end wall clock speed up at inference time with TP sharding over 8 devices. We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline.
arXiv Detail & Related papers (2025-01-11T17:06:30Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
FedScalar: A Communication efficient Federated Learning [0.0]
Federated learning (FL) has gained considerable popularity for distributed machine learning. emphFedScalar enables agents to communicate updates using a single scalar.
arXiv Detail & Related papers (2024-10-03T07:06:49Z)
FedFT: Improving Communication Performance for Federated Learning with Frequency Space Transformation [0.361593752383807]
We introduce FedFT (federated frequency-space transformation), a simple yet effective methodology for communicating model parameters in a Federated Learning setting. FedFT uses Discrete Cosine Transform (DCT) to represent model parameters in frequency space, enabling efficient compression and reducing communication overhead. We demonstrate the generalisability of the FedFT methodology on four datasets using comparative studies with three state-of-the-art FL baselines.
arXiv Detail & Related papers (2024-09-08T23:05:35Z)
ISO: Overlap of Computation and Communication within Seqenence For LLM Inference [8.616769297336708]
This paper introduces a novel strategy for computation-communication overlap that operates at the sequence level. Experimental evaluations conducted using 30b/70b models have demonstrated significant improvements in efficiency.
arXiv Detail & Related papers (2024-09-04T05:22:17Z)
High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates [50.406127962933915]
We develop solutions to problems which enable us to learn a communication-efficient distributed logistic regression model. In our experiments we demonstrate a large improvement in accuracy over distributed algorithms with only a few distributed update steps needed.
arXiv Detail & Related papers (2024-07-08T19:34:39Z)
SpaFL: Communication-Efficient Federated Learning with Sparse Models and Low computational Overhead [75.87007729801304]
SpaFL: a communication-efficient FL framework is proposed to optimize sparse model structures with low computational overhead. To optimize the pruning process itself, only thresholds are communicated between a server and clients instead of parameters. Global thresholds are used to update model parameters by extracting aggregated parameter importance.
arXiv Detail & Related papers (2024-06-01T13:10:35Z)
UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features. Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL [1.5528708400965123]
We present TACCL, a synthesizer for collective communication primitives for large-scale multi-GPU systems. TACCL encodes a profiled topology and input size into a synthesis problem to generate optimized communication algorithms. Using TACCL's algorithms speeds up the end-to-end training of an internal mixture of experts model by $17%$.
arXiv Detail & Related papers (2021-11-08T23:20:52Z)
Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively. It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z)
Straggler-aware Distributed Learning: Communication Computation Latency Trade-off [56.08535873173518]
Straggling workers can be tolerated by assigning redundant computations and coding across data and computations. In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations. Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
arXiv Detail & Related papers (2020-04-10T08:39:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.