Related papers: Scale-out Systolic Arrays

Scale-out Systolic Arrays

URL: http://arxiv.org/abs/2203.11540v1
Date: Tue, 22 Mar 2022 08:46:11 GMT
Title: Scale-out Systolic Arrays
Authors: Ahmet Caner Y\"uz\"ug\"uler, Canberk S\"onmez, Mario Drumond, Yunho Oh, Babak Falsafi, and Pascal Frossard
Abstract summary: We study three key pillars in multi-pod systolic array designs, namely array granularity, interconnect, and tiling. We identify optimal array granularity across workloads and show that state-of-the-art commercial accelerators use suboptimal array sizes for single-tenancy workloads. We propose Scale-out Sy Arrays, a multi-pod inference accelerator for both single- and multi-tenancy.
Score: 37.398797072460034
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-pod systolic arrays are emerging as the architecture of choice in DNN inference accelerators. Despite their potential, designing multi-pod systolic arrays to maximize effective throughput/Watt (i.e., throughput/Watt adjusted when accounting for array utilization) poses a unique set of challenges. In this work, we study three key pillars in multi-pod systolic array designs, namely array granularity, interconnect, and tiling. We identify optimal array granularity across workloads and show that state-of-the-art commercial accelerators use suboptimal array sizes for single-tenancy workloads. We, then evaluate the bandwidth/latency trade-offs in interconnects and show that Butterfly networks offer a scalable topology for accelerators with a large number of pods. Finally, we introduce a novel data tiling scheme with custom partition size to maximize utilization in optimally sized pods. We propose Scale-out Systolic Arrays, a multi-pod inference accelerator for both single- and multi-tenancy based on these three pillars. We show that SOSA exhibits scaling of up to 600 TeraOps/s in effective throughput for state-of-the-art DNN inference workloads, and outperforms state-of-the-art multi-pod accelerators by a factor of 1.5x.

Related papers

Multi Activity Sequence Alignment via Implicit Clustering [50.3168866743067]
We propose a novel framework that overcomes limitations using sequence alignment via implicit clustering. Specifically, our key idea is to perform implicit clip-level clustering while aligning frames in sequences. Our experiments show that our proposed method outperforms state-of-the-art results.
arXiv Detail & Related papers (2025-03-16T14:28:46Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
fVDB: A Deep-Learning Framework for Sparse, Large-Scale, and High-Performance Spatial Intelligence [50.417261057533786]
fVDB is a novel framework for deep learning on large-scale 3D data. Our framework is fully integrated with PyTorch enabling interoperability with existing pipelines.
arXiv Detail & Related papers (2024-07-01T20:20:33Z)
A Point-Based Approach to Efficient LiDAR Multi-Task Perception [49.91741677556553]
PAttFormer is an efficient multi-task architecture for joint semantic segmentation and object detection in point clouds. Unlike other LiDAR-based multi-task architectures, our proposed PAttFormer does not require separate feature encoders for task-specific point cloud representations. Our evaluations show substantial gains from multi-task learning, improving LiDAR semantic segmentation by +1.7% in mIou and 3D object detection by +1.7% in mAP.
arXiv Detail & Related papers (2024-04-19T11:24:34Z)
Reconfigurable Distributed FPGA Cluster Design for Deep Learning Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications. The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z)
High-Fidelity Transport of Trapped-Ion Qubits in a Multi-Layer Array [0.0]
We present our study of shuttling single Mg$+$ ions within a scalable trap-array architecture. In a prototype application, we demonstrate the preservation of the coherence of superposition states of a hyperfine qubit during inter-site shuttling.
arXiv Detail & Related papers (2023-05-09T19:34:50Z)
ArrayFlex: A Systolic Array Architecture with Configurable Transparent Pipelining [0.0]
Convolutional Neural Networks (CNNs) are the state-of-the-art solution for many deep learning applications. In this work, we focus on the design of a systolic array with a pipeline. We show that ArrayFlex reduces the latency of state-of-the-art CNNs by 11%, on average, as compared to a traditional fixed-pipeline systolic array.
arXiv Detail & Related papers (2022-11-22T21:56:38Z)
Lightweight and Progressively-Scalable Networks for Semantic Segmentation [100.63114424262234]
Multi-scale learning frameworks have been regarded as a capable class of models to boost semantic segmentation. In this paper, we thoroughly analyze the design of convolutional blocks and the ways of interactions across multiple scales. We devise Lightweight and Progressively-Scalable Networks (LPS-Net) that novelly expands the network complexity in a greedy manner.
arXiv Detail & Related papers (2022-07-27T16:00:28Z)
Self-Adaptive Reconfigurable Arrays (SARA): Using ML to Assist Scaling GEMM Acceleration [3.2218154783263833]
This work introduces a new class of accelerators that we call Self Adaptive Reconfigurable Array (SARA) SARA is capable of providing the same mapping flexibility as a collection of 10244x4 arrays working as a distributed system while achieving 3.5x more power efficiency and 3.2x higher compute density. We develop a novel recommendation neural network called ADAPTNET which recommends an array configuration and dataflow for the current layer parameters.
arXiv Detail & Related papers (2021-01-12T23:20:23Z)
On the Difficulty of Designing Processor Arrays for Deep Neural Networks [0.0]
Camuy is a lightweight model of a weight-stationary systolic array for linear algebra operations. We present an analysis of popular models to illustrate how it can estimate required cycles, data movement costs, as well as systolic array utilization.
arXiv Detail & Related papers (2020-06-24T19:24:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.