Related papers: ArrayFlex: A Systolic Array Architecture with Configurable Transparent Pipelining

ArrayFlex: A Systolic Array Architecture with Configurable Transparent Pipelining

URL: http://arxiv.org/abs/2211.12600v2
Date: Tue, 6 Jun 2023 09:33:37 GMT
Title: ArrayFlex: A Systolic Array Architecture with Configurable Transparent Pipelining
Authors: C. Peltekis, D. Filippas, G. Dimitrakopoulos, C. Nicopoulos, D. Pnevmatikatos
Abstract summary: Convolutional Neural Networks (CNNs) are the state-of-the-art solution for many deep learning applications. In this work, we focus on the design of a systolic array with a pipeline. We show that ArrayFlex reduces the latency of state-of-the-art CNNs by 11%, on average, as compared to a traditional fixed-pipeline systolic array.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Convolutional Neural Networks (CNNs) are the state-of-the-art solution for many deep learning applications. For maximum scalability, their computation should combine high performance and energy efficiency. In practice, the convolutions of each CNN layer are mapped to a matrix multiplication that includes all input features and kernels of each layer and is computed using a systolic array. In this work, we focus on the design of a systolic array with configurable pipeline with the goal to select an optimal pipeline configuration for each CNN layer. The proposed systolic array, called ArrayFlex, can operate in normal, or in shallow pipeline mode, thus balancing the execution time in cycles and the operating clock frequency. By selecting the appropriate pipeline configuration per CNN layer, ArrayFlex reduces the inference latency of state-of-the-art CNNs by 11%, on average, as compared to a traditional fixed-pipeline systolic array. Most importantly, this result is achieved while using 13%-23% less power, for the same applications, thus offering a combined energy-delay-product efficiency between 1.4x and 1.8x.

Related papers

Tensor Slicing and Optimization for Multicore NPUs [2.670309629218727]
This paper proposes a compiler optimization pass for Multicore NPUs, called Slicing Optimization (TSO) TSO identifies the best tensor slicing that minimizes execution time for a set of CNN models. Results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models.
arXiv Detail & Related papers (2023-04-06T12:03:03Z)
An efficient and flexible inference system for serving heterogeneous ensembles of deep neural networks [0.0]
Ensembles of Deep Neural Networks (DNNs) have achieved qualitative predictions but they are computing and memory intensive. We propose a new software layer to serve with flexibility and efficiency ensembles of DNNs.
arXiv Detail & Related papers (2022-08-30T08:05:43Z)
Lightweight and Progressively-Scalable Networks for Semantic Segmentation [100.63114424262234]
Multi-scale learning frameworks have been regarded as a capable class of models to boost semantic segmentation. In this paper, we thoroughly analyze the design of convolutional blocks and the ways of interactions across multiple scales. We devise Lightweight and Progressively-Scalable Networks (LPS-Net) that novelly expands the network complexity in a greedy manner.
arXiv Detail & Related papers (2022-07-27T16:00:28Z)
Towards a General Purpose CNN for Long Range Dependencies in $\mathrm{N}$D [49.57261544331683]
We propose a single CNN architecture equipped with continuous convolutional kernels for tasks on arbitrary resolution, dimensionality and length without structural changes. We show the generality of our approach by applying the same CCNN to a wide set of tasks on sequential (1$mathrmD$) and visual data (2$mathrmD$) Our CCNN performs competitively and often outperforms the current state-of-the-art across all tasks considered.
arXiv Detail & Related papers (2022-06-07T15:48:02Z)
Instant Neural Graphics Primitives with a Multiresolution Hash Encoding [67.33850633281803]
We present a versatile new input encoding that permits the use of a smaller network without sacrificing quality. A small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through a gradient descent. We achieve a combined speed of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds.
arXiv Detail & Related papers (2022-01-16T07:22:47Z)
Design and Scaffolded Training of an Efficient DNN Operator for Computer Vision on the Edge [3.3767251810292955]
FuSeConv is a drop-in replacement for depthwise separable convolutions. FuSeConv factorizes convolution fully along their spatial and depth dimensions. Neural Operator Scaffolding scaffolds the training of FuSeConv by distilling knowledge from depthwise separable convolutions.
arXiv Detail & Related papers (2021-08-25T19:22:25Z)
HANT: Hardware-Aware Network Transformation [82.54824188745887]
We propose hardware-aware network transformation (HANT) HANT replaces inefficient operations with more efficient alternatives using a neural architecture search like approach. Our results on accelerating the EfficientNet family show that HANT can accelerate them by up to 3.6x with 0.4% drop in the top-1 accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2021-07-12T18:46:34Z)
FuSeConv: Fully Separable Convolutions for Fast Inference on Systolic Arrays [2.8583189395674653]
We propose FuSeConv as a drop-in replacement for depth-wise separable convolution. FuSeConv generalizes the decomposition of convolutions fully to separable 1D convolutions along spatial and depth dimensions. We achieve a significant speed-up of 3x-7x with the MobileNet family of networks on a systolic array of size 64x64, with comparable accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2021-05-27T20:19:39Z)
Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters. It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions. We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z)
Implicit Convex Regularizers of CNN Architectures: Convex Optimization of Two- and Three-Layer Networks in Polynomial Time [70.15611146583068]
We study training of Convolutional Neural Networks (CNNs) with ReLU activations. We introduce exact convex optimization with a complexity with respect to the number of data samples, the number of neurons, and data dimension.
arXiv Detail & Related papers (2020-06-26T04:47:20Z)
FlexSA: Flexible Systolic Array Architecture for Efficient Pruned DNN Model Training [1.718730454558804]
We find that pruning a model using a common training accelerator with large systolic arrays is extremely performance-inefficient. To make a systolic array efficient for pruning and training, we propose FlexSA, a flexible systolic array architecture. We also present a compilation for tiling matrix-multiplication-and-accumulation operations in a training workload to best utilize the resources of FlexSA.
arXiv Detail & Related papers (2020-04-27T15:51:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.