Related papers: Efficient Data-Plane Memory Scheduling for In-Network Aggregation

Efficient Data-Plane Memory Scheduling for In-Network Aggregation

URL: http://arxiv.org/abs/2201.06398v1
Date: Mon, 17 Jan 2022 13:29:18 GMT
Title: Efficient Data-Plane Memory Scheduling for In-Network Aggregation
Authors: Hao Wang, Yuxuan Qin, ChonLam Lao, Yanfang Le, Wenfei Wu, Kai Chen
Abstract summary: We propose ESA, an $underlineE$fficient Switch Memory $underlineS$cheduler for In-Network $underlineA$ggregation. At its cores, ESA enforces the aggregator allocation primitive and introduces priority scheduling at the data-plane. Experiments show that ESA can improve the average JCT by up to $1.35times$.
Score: 14.52822604368543
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: As the scale of distributed training grows, communication becomes a bottleneck. To accelerate the communication, recent works introduce In-Network Aggregation (INA), which moves the gradients summation into network middle-boxes, e.g., programmable switches to reduce the traffic volume. However, switch memory is scarce compared to the volume of gradients transmitted in distributed training. Although literature applies methods like pool-based streaming or dynamic sharing to tackle the mismatch, switch memory is still a potential performance bottleneck. Furthermore, we observe the under-utilization of switch memory due to the synchronization requirement for aggregator deallocation in recent works. To improve the switch memory utilization, we propose ESA, an $\underline{E}$fficient Switch Memory $\underline{S}$cheduler for In-Network $\underline{A}$ggregation. At its cores, ESA enforces the preemptive aggregator allocation primitive and introduces priority scheduling at the data-plane, which improves the switch memory utilization and average job completion time (JCT). Experiments show that ESA can improve the average JCT by up to $1.35\times$.

Related papers

COMPASS: A Compiler Framework for Resource-Constrained Crossbar-Array Based In-Memory Deep Learning Accelerators [6.172271429579593]
We propose a compiler framework for resource-constrained crossbar-based processing-in-memory (PIM) deep neural network (DNN) accelerators. We propose an algorithm to determine the optimal partitioning that divides the layers so that each partition can be accelerated on chip.
arXiv Detail & Related papers (2025-01-12T11:31:25Z)
Optimal Gradient Checkpointing for Sparse and Recurrent Architectures using Off-Chip Memory [0.8321953606016751]
We introduce memory-efficient gradient checkpointing strategies tailored for the general class of sparse RNNs and Spiking Neural Networks. We find that Double Checkpointing emerges as the most effective method, optimizing the use of local memory resources while minimizing recomputation overhead.
arXiv Detail & Related papers (2024-12-16T14:23:31Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training [16.560270624096706]
We propose a memory-efficient optimization algorithm tailored for distributed training of Large Language Models. Our method relies on a novel technique to mitigate the one-step delay inherent in parallel execution of gradient computations and communications.
arXiv Detail & Related papers (2024-06-03T08:23:45Z)
Improved Robustness and Hyperparameter Selection in the Dense Associative Memory [1.2289361708127877]
The Dense Associative Memory generalizes the Hopfield network by allowing for sharper interaction functions. However, the implementation of the network relies on applying large exponents to the dot product of memory vectors and probe vectors. We describe the computational issues in detail, modify the original network description to mitigate the problem, and show the modification will not alter the networks' dynamics.
arXiv Detail & Related papers (2024-05-29T01:23:19Z)
UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory [69.33445217944029]
PETL is an effective strategy for adapting pre-trained models to downstream domains. Recent PETL works focus on the more valuable memory-efficient characteristic. We propose a new memory-efficient PETL strategy, Universal Parallel Tuning (UniPT)
arXiv Detail & Related papers (2023-08-28T05:38:43Z)
Accelerating Transfer Learning with Near-Data Computation on Cloud Object Stores [4.774170751209782]
We show how ML training benefits from storage pushdowns by focusing on transfer learning (TL) We propose HAPI, a new TL processing system centered around two complementary techniques that address challenges introduced by disaggregation.
arXiv Detail & Related papers (2022-10-16T22:28:36Z)
NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS) We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z)
Memory-Guided Semantic Learning Network for Temporal Sentence Grounding [55.31041933103645]
We propose a memory-augmented network that learns and memorizes the rarely appeared content in TSG tasks. MGSL-Net consists of three main parts: a cross-modal inter-action module, a memory augmentation module, and a heterogeneous attention module.
arXiv Detail & Related papers (2022-01-03T02:32:06Z)
Layer-Parallel Training of Residual Networks with Auxiliary-Variable Networks [28.775355111614484]
auxiliary-variable methods have attracted much interest lately but suffer from significant communication overhead and lack of data augmentation. We present a novel joint learning framework for training realistic ResNets across multiple compute devices. We demonstrate the effectiveness of our methods on ResNets and WideResNets across CIFAR-10, CIFAR-100, and ImageNet datasets.
arXiv Detail & Related papers (2021-12-10T08:45:35Z)
Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers. Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z)
MAFAT: Memory-Aware Fusing and Tiling of Neural Networks for Accelerated Edge Inference [1.7894377200944507]
Machine learning networks can easily exceed available memory, increasing latency due to excessive OS swapping. We propose a memory usage predictor coupled with a search algorithm to provide optimized fusing and tiling configurations. Results show that our approach can run in less than half the memory, and with a speedup of up to 2.78 under severe memory constraints.
arXiv Detail & Related papers (2021-07-14T19:45:49Z)
Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models. In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers. We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.