The Programming of Deep Learning Accelerators as a Constraint
Satisfaction Problem
- URL: http://arxiv.org/abs/2104.04731v2
- Date: Tue, 13 Apr 2021 06:16:45 GMT
- Title: The Programming of Deep Learning Accelerators as a Constraint
Satisfaction Problem
- Authors: Dennis Rieber, Axel Acosta, Holger Fr\"oning
- Abstract summary: We propose a new approach to implementing operators efficiently with complex instructions such as matrix multiply.
By formulating the embedding as a constraint satisfaction problem over the scalar dataflow, every possible embedding solution is contained in the search space.
A detailed evaluation using the VTA hardware accelerator with the Baidu DeepBench inference benchmark suite shows that our approach can automatically generate code competitive to reference implementations.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of Deep Artificial Neural Networks (DNNs) in many domains created
a rich body of research concerned with hardware accelerators for
compute-intensive DNN operators. However, implementing such operators
efficiently with complex instructions such as matrix multiply is a task not yet
automated gracefully. Solving this task often requires complex program and
memory layout transformations. First solutions to this problem have been
proposed, such as TVM or ISAMIR, which work on a loop-level representation of
operators and rewrite the program before an instruction embedding into the
operator is performed. This top-down approach creates a tension between
exploration range and search space complexity. In this work, we propose a new
approach to this problem. We have created a bottom-up method that allows the
direct generation of implementations based on an accelerator's instruction set.
By formulating the embedding as a constraint satisfaction problem over the
scalar dataflow, every possible embedding solution is contained in the search
space. By adding additional constraints, a solver can produce the subset of
preferable solutions. A detailed evaluation using the VTA hardware accelerator
with the Baidu DeepBench inference benchmark suite shows that our approach can
automatically generate code competitive to reference implementations, and
furthermore that memory layout flexibilty can be beneficial for overall
performance. While the reference implementation achieves very low hardware
utilization due to its fixed embedding strategy, we achieve a geomean speedup
of up to x2.49, while individual operators can improve as much as x238.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization [0.6445087473595953]
Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning.
deploying LLM inference poses challenges due to the high compute and memory requirements.
We present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision.
arXiv Detail & Related papers (2024-06-16T09:51:55Z) - Complexity-Driven CNN Compression for Resource-constrained Edge AI [1.6114012813668934]
We propose a novel and computationally efficient pruning pipeline by exploiting the inherent layer-level complexities of CNNs.
We define three modes of pruning, namely parameter-aware (PA), FLOPs-aware (FA), and memory-aware (MA), to introduce versatile compression of CNNs.
arXiv Detail & Related papers (2022-08-26T16:01:23Z) - Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs [64.26714148634228]
congestion control (CC) algorithms become extremely difficult to design.
It is currently not possible to deploy AI models on network devices due to their limited computational capabilities.
We build a computationally-light solution based on a recent reinforcement learning CC algorithm.
arXiv Detail & Related papers (2022-07-05T20:42:24Z) - Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor
Operations on Spatial Accelerators [4.055002321981825]
We present a HW-SW co-design ecosystem for spatial accelerators called Union.
Our framework allows exploring different algorithms and their mappings on several accelerator cost models.
We demonstrate the value of Union for the community with several case studies.
arXiv Detail & Related papers (2021-09-15T16:42:18Z) - DeepSplit: Scalable Verification of Deep Neural Networks via Operator
Splitting [70.62923754433461]
Analyzing the worst-case performance of deep neural networks against input perturbations amounts to solving a large-scale non- optimization problem.
We propose a novel method that can directly solve a convex relaxation of the problem to high accuracy, by splitting it into smaller subproblems that often have analytical solutions.
arXiv Detail & Related papers (2021-06-16T20:43:49Z) - CoSA: Scheduling by Constrained Optimization for Spatial Accelerators [1.9149970150912705]
We present CoSA, a constrained-optimization-based approach for scheduling Deep Neural Networks (DNNs) accelerators.
As opposed to existing approaches that either rely on designers's or iterative methods to navigate the search space, CoSA expresses scheduling decisions as a constrained-optimization problem.
We demonstrate that CoSA-generated schedules significantly outperform state-of-the-art approaches by a geometric mean of up to 2.5x.
arXiv Detail & Related papers (2021-05-05T07:17:25Z) - Domain-specific Genetic Algorithm for Multi-tenant DNNAccelerator
Scheduling [3.8530020696501794]
There is a growing trend towards building large accelerators with several sub-accelerator cores/chiplets.
This work looks at the problem of supporting multi-tenancy on such accelerators.
We develop a specialized genetic algorithm called G# withcustom operators to enable structured sample-efficient exploration.
arXiv Detail & Related papers (2021-04-28T19:57:55Z) - Fast and Complete: Enabling Complete Neural Network Verification with
Rapid and Massively Parallel Incomplete Verifiers [112.23981192818721]
We propose to use backward mode linear relaxation based analysis (LiRPA) to replace Linear Programming (LP) during the BaB process.
Unlike LP, LiRPA when applied naively can produce much weaker bounds and even cannot check certain conflicts of sub-domains during splitting.
We demonstrate an order of magnitude speedup compared to existing LP-based approaches.
arXiv Detail & Related papers (2020-11-27T16:42:12Z) - Jump Operator Planning: Goal-Conditioned Policy Ensembles and Zero-Shot
Transfer [71.44215606325005]
We propose a novel framework called Jump-Operator Dynamic Programming for quickly computing solutions within a super-exponential space of sequential sub-goal tasks.
This approach involves controlling over an ensemble of reusable goal-conditioned polices functioning as temporally extended actions.
We then identify classes of objective functions on this subspace whose solutions are invariant to the grounding, resulting in optimal zero-shot transfer.
arXiv Detail & Related papers (2020-07-06T05:13:20Z) - Physarum Powered Differentiable Linear Programming Layers and
Applications [48.77235931652611]
We propose an efficient and differentiable solver for general linear programming problems.
We show the use of our solver in a video segmentation task and meta-learning for few-shot learning.
arXiv Detail & Related papers (2020-04-30T01:50:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.