Domain-specific Genetic Algorithm for Multi-tenant DNNAccelerator
Scheduling
- URL: http://arxiv.org/abs/2104.13997v2
- Date: Fri, 30 Apr 2021 14:41:36 GMT
- Title: Domain-specific Genetic Algorithm for Multi-tenant DNNAccelerator
Scheduling
- Authors: Sheng-Chun Kao, Tushar Krishna
- Abstract summary: There is a growing trend towards building large accelerators with several sub-accelerator cores/chiplets.
This work looks at the problem of supporting multi-tenancy on such accelerators.
We develop a specialized genetic algorithm called G# withcustom operators to enable structured sample-efficient exploration.
- Score: 3.8530020696501794
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As Deep Learning continues to drive a variety of applications in datacenters
and HPC, there is a growing trend towards building large accelerators with
several sub-accelerator cores/chiplets. This work looks at the problem of
supporting multi-tenancy on such accelerators. In particular, we focus on the
problem of mapping layers from several DNNs simultaneously on an accelerator.
Given the extremely large search space, we formulate the search as an
optimization problem and develop a specialized genetic algorithm called G#
withcustom operators to enable structured sample-efficient exploration. We
quantitatively compare G# with several common heuristics, state-of-the-art
optimization methods, and reinforcement learning methods across different
accelerator set-tings (large/small accelerators) and different sub-accelerator
configurations (homogeneous/heterogeneous), and observeG# can consistently find
better solutions. Further, to enable real-time scheduling, we also demonstrate
a method to generalize the learnt schedules and transfer them to the next batch
of jobs, reducing schedule compute time to near zero.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - AcceleratedLiNGAM: Learning Causal DAGs at the speed of GPUs [57.12929098407975]
We show that by efficiently parallelizing existing causal discovery methods, we can scale them to thousands of dimensions.
Specifically, we focus on the causal ordering subprocedure in DirectLiNGAM and implement GPU kernels to accelerate it.
This allows us to apply DirectLiNGAM to causal inference on large-scale gene expression data with genetic interventions yielding competitive results.
arXiv Detail & Related papers (2024-03-06T15:06:11Z) - Teal: Learning-Accelerated Optimization of WAN Traffic Engineering [68.7863363109948]
We present Teal, a learning-based TE algorithm that leverages the parallel processing power of GPUs to accelerate TE control.
To reduce the problem scale and make learning tractable, Teal employs a multi-agent reinforcement learning (RL) algorithm to independently allocate each traffic demand.
Compared with other TE acceleration schemes, Teal satisfies 6--32% more traffic demand and yields 197--625x speedups.
arXiv Detail & Related papers (2022-10-25T04:46:30Z) - Demystifying Map Space Exploration for NPUs [4.817475305740601]
Map Space Exploration is the problem of finding optimized mappings of a Deep Neural Network (DNN) model.
We do a first-of-its-kind apples-to-apples comparison of search techniques leveraged by different mappers.
Next, we propose two new techniques that can augment existing mappers.
arXiv Detail & Related papers (2022-10-07T17:58:45Z) - Flipping the switch on local exploration: Genetic Algorithms with
Reversals [0.0]
Authors show that gradient-free search techniques are suitable for providing an optimal solution in the discrete domain.
They also show that the use of multiple local searches can improve performance on local searches.
It is observed that the proposed GA variants have the least average cost across all benchmarks including the problem proposed and IC performs better than its constituents.
arXiv Detail & Related papers (2022-02-02T08:27:11Z) - Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor
Operations on Spatial Accelerators [4.055002321981825]
We present a HW-SW co-design ecosystem for spatial accelerators called Union.
Our framework allows exploring different algorithms and their mappings on several accelerator cost models.
We demonstrate the value of Union for the community with several case studies.
arXiv Detail & Related papers (2021-09-15T16:42:18Z) - Multi-task Over-the-Air Federated Learning: A Non-Orthogonal
Transmission Approach [52.85647632037537]
We propose a multi-task over-theair federated learning (MOAFL) framework, where multiple learning tasks share edge devices for data collection and learning models under the coordination of a edge server (ES)
Both the convergence analysis and numerical results demonstrate that the MOAFL framework can significantly reduce the uplink bandwidth consumption of multiple tasks without causing substantial learning performance degradation.
arXiv Detail & Related papers (2021-06-27T13:09:32Z) - CoSA: Scheduling by Constrained Optimization for Spatial Accelerators [1.9149970150912705]
We present CoSA, a constrained-optimization-based approach for scheduling Deep Neural Networks (DNNs) accelerators.
As opposed to existing approaches that either rely on designers's or iterative methods to navigate the search space, CoSA expresses scheduling decisions as a constrained-optimization problem.
We demonstrate that CoSA-generated schedules significantly outperform state-of-the-art approaches by a geometric mean of up to 2.5x.
arXiv Detail & Related papers (2021-05-05T07:17:25Z) - The Programming of Deep Learning Accelerators as a Constraint
Satisfaction Problem [0.0]
We propose a new approach to implementing operators efficiently with complex instructions such as matrix multiply.
By formulating the embedding as a constraint satisfaction problem over the scalar dataflow, every possible embedding solution is contained in the search space.
A detailed evaluation using the VTA hardware accelerator with the Baidu DeepBench inference benchmark suite shows that our approach can automatically generate code competitive to reference implementations.
arXiv Detail & Related papers (2021-04-10T10:39:47Z) - Gradient Coding with Dynamic Clustering for Straggler-Tolerant
Distributed Learning [55.052517095437]
gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers.
A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is $straggling$ workers.
Coded distributed techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers.
We propose a novel dynamic GC scheme, which assigns redundant data to workers to acquire the flexibility to choose from among a set of possible codes depending on the past straggling behavior.
arXiv Detail & Related papers (2021-03-01T18:51:29Z) - CATCH: Context-based Meta Reinforcement Learning for Transferrable
Architecture Search [102.67142711824748]
CATCH is a novel Context-bAsed meTa reinforcement learning algorithm for transferrable arChitecture searcH.
The combination of meta-learning and RL allows CATCH to efficiently adapt to new tasks while being agnostic to search spaces.
It is also capable of handling cross-domain architecture search as competitive networks on ImageNet, COCO, and Cityscapes are identified.
arXiv Detail & Related papers (2020-07-18T09:35:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.