Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor
Operations on Spatial Accelerators
- URL: http://arxiv.org/abs/2109.07419v1
- Date: Wed, 15 Sep 2021 16:42:18 GMT
- Title: Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor
Operations on Spatial Accelerators
- Authors: Geonhwa Jeong, Gokcen Kestor, Prasanth Chatarasi, Angshuman Parashar,
Po-An Tsai, Sivasankaran Rajamanickam, Roberto Gioiosa, Tushar Krishna
- Abstract summary: We present a HW-SW co-design ecosystem for spatial accelerators called Union.
Our framework allows exploring different algorithms and their mappings on several accelerator cost models.
We demonstrate the value of Union for the community with several case studies.
- Score: 4.055002321981825
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: To meet the extreme compute demands for deep learning across commercial and
scientific applications, dataflow accelerators are becoming increasingly
popular. While these "domain-specific" accelerators are not fully programmable
like CPUs and GPUs, they retain varying levels of flexibility with respect to
data orchestration, i.e., dataflow and tiling optimizations to enhance
efficiency. There are several challenges when designing new algorithms and
mapping approaches to execute the algorithms for a target problem on new
hardware. Previous works have addressed these challenges individually. To
address this challenge as a whole, in this work, we present a HW-SW co-design
ecosystem for spatial accelerators called Union within the popular MLIR
compiler infrastructure. Our framework allows exploring different algorithms
and their mappings on several accelerator cost models. Union also includes a
plug-and-play library of accelerator cost models and mappers which can easily
be extended. The algorithms and accelerator cost models are connected via a
novel mapping abstraction that captures the map space of spatial accelerators
which can be systematically pruned based on constraints from the hardware,
workload, and mapper. We demonstrate the value of Union for the community with
several case studies which examine offloading different tensor
operations(CONV/GEMM/Tensor Contraction) on diverse accelerator architectures
using different mapping schemes.
Related papers
- AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation [48.82264764771652]
We introduce AsCAN -- a hybrid architecture, combining both convolutional and transformer blocks.
AsCAN supports a variety of tasks: recognition, segmentation, class-conditional image generation.
We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance.
arXiv Detail & Related papers (2024-11-07T18:43:17Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Inference Optimization of Foundation Models on AI Accelerators [68.24450520773688]
Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI.
As the number of model parameters reaches to hundreds of billions, their deployment incurs prohibitive inference costs and high latency in real-world scenarios.
This tutorial offers a comprehensive discussion on complementary inference optimization techniques using AI accelerators.
arXiv Detail & Related papers (2024-07-12T09:24:34Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Data-Driven Offline Optimization For Architecting Hardware Accelerators [89.68870139177785]
We develop a data-driven offline optimization method for designing hardware accelerators, dubbed PRIME.
PRIME improves performance upon state-of-the-art simulation-driven methods by about 1.54x and 1.20x, while considerably reducing the required total simulation time by 93% and 99%, respectively.
In addition, PRIME also architects effective accelerators for unseen applications in a zero-shot setting, outperforming simulation-based methods by 1.26x.
arXiv Detail & Related papers (2021-10-20T17:06:09Z) - Multi-task Over-the-Air Federated Learning: A Non-Orthogonal
Transmission Approach [52.85647632037537]
We propose a multi-task over-theair federated learning (MOAFL) framework, where multiple learning tasks share edge devices for data collection and learning models under the coordination of a edge server (ES)
Both the convergence analysis and numerical results demonstrate that the MOAFL framework can significantly reduce the uplink bandwidth consumption of multiple tasks without causing substantial learning performance degradation.
arXiv Detail & Related papers (2021-06-27T13:09:32Z) - Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix
Multiplication [4.878665155352402]
We develop a framework that finds optimized mappings for a tiled GEMM for a given spatial accelerator and workload combination.
Our evaluations over five spatial accelerators demonstrate that the tiled GEMM mappings systematically generated by our framework achieve high performance.
arXiv Detail & Related papers (2021-06-19T13:53:58Z) - Domain-specific Genetic Algorithm for Multi-tenant DNNAccelerator
Scheduling [3.8530020696501794]
There is a growing trend towards building large accelerators with several sub-accelerator cores/chiplets.
This work looks at the problem of supporting multi-tenancy on such accelerators.
We develop a specialized genetic algorithm called G# withcustom operators to enable structured sample-efficient exploration.
arXiv Detail & Related papers (2021-04-28T19:57:55Z) - The Programming of Deep Learning Accelerators as a Constraint
Satisfaction Problem [0.0]
We propose a new approach to implementing operators efficiently with complex instructions such as matrix multiply.
By formulating the embedding as a constraint satisfaction problem over the scalar dataflow, every possible embedding solution is contained in the search space.
A detailed evaluation using the VTA hardware accelerator with the Baidu DeepBench inference benchmark suite shows that our approach can automatically generate code competitive to reference implementations.
arXiv Detail & Related papers (2021-04-10T10:39:47Z) - Hardware Acceleration of Sparse and Irregular Tensor Computations of ML
Models: A Survey and Insights [18.04657939198617]
This paper provides a comprehensive survey on the efficient execution of sparse and irregular tensor computations of machine learning models on hardware accelerators.
It analyzes different hardware designs and acceleration techniques and analyzes them in terms of hardware and execution costs.
The takeaways from this paper include: understanding the key challenges in accelerating sparse, irregular-shaped, and quantized tensors.
arXiv Detail & Related papers (2020-07-02T04:08:40Z) - Dataflow Aware Mapping of Convolutional Neural Networks Onto Many-Core
Platforms With Network-on-Chip Interconnect [0.0764671395172401]
Machine intelligence, especially using convolutional neural networks (CNNs), has become a large area of research over the past years.
Many-core platforms consisting of several homogeneous cores can alleviate limitations with regard to physical implementation at the expense of an increased dataflow mapping effort.
This work presents an automated mapping strategy starting at the single-core level with different optimization targets for minimal runtime and minimal off-chip memory accesses.
The strategy is then extended towards a suitable many-core mapping scheme and evaluated using a scalable system-level simulation with a network-on-chip interconnect.
arXiv Detail & Related papers (2020-06-18T17:13:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.