Scale-out Systolic Arrays
- URL: http://arxiv.org/abs/2203.11540v1
- Date: Tue, 22 Mar 2022 08:46:11 GMT
- Title: Scale-out Systolic Arrays
- Authors: Ahmet Caner Y\"uz\"ug\"uler, Canberk S\"onmez, Mario Drumond, Yunho
Oh, Babak Falsafi, and Pascal Frossard
- Abstract summary: We study three key pillars in multi-pod systolic array designs, namely array granularity, interconnect, and tiling.
We identify optimal array granularity across workloads and show that state-of-the-art commercial accelerators use suboptimal array sizes for single-tenancy workloads.
We propose Scale-out Sy Arrays, a multi-pod inference accelerator for both single- and multi-tenancy.
- Score: 37.398797072460034
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-pod systolic arrays are emerging as the architecture of choice in DNN
inference accelerators. Despite their potential, designing multi-pod systolic
arrays to maximize effective throughput/Watt (i.e., throughput/Watt adjusted
when accounting for array utilization) poses a unique set of challenges. In
this work, we study three key pillars in multi-pod systolic array designs,
namely array granularity, interconnect, and tiling. We identify optimal array
granularity across workloads and show that state-of-the-art commercial
accelerators use suboptimal array sizes for single-tenancy workloads. We, then
evaluate the bandwidth/latency trade-offs in interconnects and show that
Butterfly networks offer a scalable topology for accelerators with a large
number of pods. Finally, we introduce a novel data tiling scheme with custom
partition size to maximize utilization in optimally sized pods. We propose
Scale-out Systolic Arrays, a multi-pod inference accelerator for both single-
and multi-tenancy based on these three pillars. We show that SOSA exhibits
scaling of up to 600 TeraOps/s in effective throughput for state-of-the-art DNN
inference workloads, and outperforms state-of-the-art multi-pod accelerators by
a factor of 1.5x.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - fVDB: A Deep-Learning Framework for Sparse, Large-Scale, and High-Performance Spatial Intelligence [50.417261057533786]
fVDB is a novel framework for deep learning on large-scale 3D data.
Our framework is fully integrated with PyTorch enabling interoperability with existing pipelines.
arXiv Detail & Related papers (2024-07-01T20:20:33Z) - A Point-Based Approach to Efficient LiDAR Multi-Task Perception [49.91741677556553]
PAttFormer is an efficient multi-task architecture for joint semantic segmentation and object detection in point clouds.
Unlike other LiDAR-based multi-task architectures, our proposed PAttFormer does not require separate feature encoders for task-specific point cloud representations.
Our evaluations show substantial gains from multi-task learning, improving LiDAR semantic segmentation by +1.7% in mIou and 3D object detection by +1.7% in mAP.
arXiv Detail & Related papers (2024-04-19T11:24:34Z) - Reconfigurable Distributed FPGA Cluster Design for Deep Learning
Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications.
The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z) - High-Fidelity Transport of Trapped-Ion Qubits in a Multi-Layer Array [0.0]
We present our study of shuttling single Mg$+$ ions within a scalable trap-array architecture.
In a prototype application, we demonstrate the preservation of the coherence of superposition states of a hyperfine qubit during inter-site shuttling.
arXiv Detail & Related papers (2023-05-09T19:34:50Z) - ArrayFlex: A Systolic Array Architecture with Configurable Transparent
Pipelining [0.0]
Convolutional Neural Networks (CNNs) are the state-of-the-art solution for many deep learning applications.
In this work, we focus on the design of a systolic array with a pipeline.
We show that ArrayFlex reduces the latency of state-of-the-art CNNs by 11%, on average, as compared to a traditional fixed-pipeline systolic array.
arXiv Detail & Related papers (2022-11-22T21:56:38Z) - Lightweight and Progressively-Scalable Networks for Semantic
Segmentation [100.63114424262234]
Multi-scale learning frameworks have been regarded as a capable class of models to boost semantic segmentation.
In this paper, we thoroughly analyze the design of convolutional blocks and the ways of interactions across multiple scales.
We devise Lightweight and Progressively-Scalable Networks (LPS-Net) that novelly expands the network complexity in a greedy manner.
arXiv Detail & Related papers (2022-07-27T16:00:28Z) - Self-Adaptive Reconfigurable Arrays (SARA): Using ML to Assist Scaling
GEMM Acceleration [3.2218154783263833]
This work introduces a new class of accelerators that we call Self Adaptive Reconfigurable Array (SARA)
SARA is capable of providing the same mapping flexibility as a collection of 10244x4 arrays working as a distributed system while achieving 3.5x more power efficiency and 3.2x higher compute density.
We develop a novel recommendation neural network called ADAPTNET which recommends an array configuration and dataflow for the current layer parameters.
arXiv Detail & Related papers (2021-01-12T23:20:23Z) - On the Difficulty of Designing Processor Arrays for Deep Neural Networks [0.0]
Camuy is a lightweight model of a weight-stationary systolic array for linear algebra operations.
We present an analysis of popular models to illustrate how it can estimate required cycles, data movement costs, as well as systolic array utilization.
arXiv Detail & Related papers (2020-06-24T19:24:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.