Efficient Data-Plane Memory Scheduling for In-Network Aggregation
- URL: http://arxiv.org/abs/2201.06398v1
- Date: Mon, 17 Jan 2022 13:29:18 GMT
- Title: Efficient Data-Plane Memory Scheduling for In-Network Aggregation
- Authors: Hao Wang, Yuxuan Qin, ChonLam Lao, Yanfang Le, Wenfei Wu, Kai Chen
- Abstract summary: We propose ESA, an $underlineE$fficient Switch Memory $underlineS$cheduler for In-Network $underlineA$ggregation.
At its cores, ESA enforces the aggregator allocation primitive and introduces priority scheduling at the data-plane.
Experiments show that ESA can improve the average JCT by up to $1.35times$.
- Score: 14.52822604368543
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: As the scale of distributed training grows, communication becomes a
bottleneck. To accelerate the communication, recent works introduce In-Network
Aggregation (INA), which moves the gradients summation into network
middle-boxes, e.g., programmable switches to reduce the traffic volume.
However, switch memory is scarce compared to the volume of gradients
transmitted in distributed training. Although literature applies methods like
pool-based streaming or dynamic sharing to tackle the mismatch, switch memory
is still a potential performance bottleneck. Furthermore, we observe the
under-utilization of switch memory due to the synchronization requirement for
aggregator deallocation in recent works. To improve the switch memory
utilization, we propose ESA, an $\underline{E}$fficient Switch Memory
$\underline{S}$cheduler for In-Network $\underline{A}$ggregation. At its cores,
ESA enforces the preemptive aggregator allocation primitive and introduces
priority scheduling at the data-plane, which improves the switch memory
utilization and average job completion time (JCT). Experiments show that ESA
can improve the average JCT by up to $1.35\times$.
Related papers
- SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training [16.560270624096706]
We propose a memory-efficient optimization algorithm tailored for distributed training of Large Language Models.
Our method relies on a novel technique to mitigate the one-step delay inherent in parallel execution of gradient computations and communications.
arXiv Detail & Related papers (2024-06-03T08:23:45Z) - UniPT: Universal Parallel Tuning for Transfer Learning with Efficient
Parameter and Memory [69.33445217944029]
PETL is an effective strategy for adapting pre-trained models to downstream domains.
Recent PETL works focus on the more valuable memory-efficient characteristic.
We propose a new memory-efficient PETL strategy, Universal Parallel Tuning (UniPT)
arXiv Detail & Related papers (2023-08-28T05:38:43Z) - Fast Distributed Inference Serving for Large Language Models [12.682341873843882]
Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT.
The interactive nature of these applications demand low job completion time (JCT) for model inference.
We present FastServe, a distributed inference serving system for LLMs.
arXiv Detail & Related papers (2023-05-10T06:17:50Z) - Accelerating Transfer Learning with Near-Data Computation on Cloud
Object Stores [5.057544107331778]
This paper identifies transfer learning (TL) as a natural fit for the disaggregated cloud.
We show how to leverage the unique structure of TL's fine-tuning phase to flexibly address the aforementioned constraints.
We present HAPI, a processing system for TL that spans the compute and storage tiers while remaining transparent to the user.
arXiv Detail & Related papers (2022-10-16T22:28:36Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z) - Memory-Guided Semantic Learning Network for Temporal Sentence Grounding [55.31041933103645]
We propose a memory-augmented network that learns and memorizes the rarely appeared content in TSG tasks.
MGSL-Net consists of three main parts: a cross-modal inter-action module, a memory augmentation module, and a heterogeneous attention module.
arXiv Detail & Related papers (2022-01-03T02:32:06Z) - Layer-Parallel Training of Residual Networks with Auxiliary-Variable
Networks [28.775355111614484]
auxiliary-variable methods have attracted much interest lately but suffer from significant communication overhead and lack of data augmentation.
We present a novel joint learning framework for training realistic ResNets across multiple compute devices.
We demonstrate the effectiveness of our methods on ResNets and WideResNets across CIFAR-10, CIFAR-100, and ImageNet datasets.
arXiv Detail & Related papers (2021-12-10T08:45:35Z) - Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers.
Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z) - MAFAT: Memory-Aware Fusing and Tiling of Neural Networks for Accelerated
Edge Inference [1.7894377200944507]
Machine learning networks can easily exceed available memory, increasing latency due to excessive OS swapping.
We propose a memory usage predictor coupled with a search algorithm to provide optimized fusing and tiling configurations.
Results show that our approach can run in less than half the memory, and with a speedup of up to 2.78 under severe memory constraints.
arXiv Detail & Related papers (2021-07-14T19:45:49Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.