Related papers: TrIM, Triangular Input Movement Systolic Array for Convolutional Neural Networks: Architecture and Hardware Implementation

TrIM, Triangular Input Movement Systolic Array for Convolutional Neural Networks: Architecture and Hardware Implementation

URL: http://arxiv.org/abs/2408.10243v2
Date: Tue, 14 Jan 2025 11:23:05 GMT
Title: TrIM, Triangular Input Movement Systolic Array for Convolutional Neural Networks: Architecture and Hardware Implementation
Authors: Cristian Sestito, Shady Agwa, Themis Prodromakis,
Abstract summary: TrIM is an innovative dataflow based on a triangular movement of inputs.<n> TrIM can reduce the number of memory accesses by one order of magnitude when compared to state-of-the-art systolic arrays.<n> architecture achieves a peak throughput of 453.6 Giga Operations per Second.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern hardware architectures for Convolutional Neural Networks (CNNs), other than targeting high performance, aim at dissipating limited energy. Reducing the data movement cost between the computing cores and the memory is a way to mitigate the energy consumption. Systolic arrays are suitable architectures to achieve this objective: they use multiple processing elements that communicate each other to maximize data utilization, based on proper dataflows like the weight stationary and row stationary. Motivated by this, we have proposed TrIM, an innovative dataflow based on a triangular movement of inputs, and capable to reduce the number of memory accesses by one order of magnitude when compared to state-of-the-art systolic arrays. In this paper, we present a TrIM-based hardware architecture for CNNs. As a showcase, the accelerator is implemented onto a Field Programmable Gate Array (FPGA) to execute the VGG-16 and AlexNet CNNs. The architecture achieves a peak throughput of 453.6 Giga Operations per Second, outperforming a state-of-the-art row stationary systolic array up to ~3x in terms of memory accesses, and being up to ~11.9x more energy-efficient than other FPGA accelerators.

Related papers

Efficient Message Passing Architecture for GCN Training on HBM-based FPGAs with Orthogonal Topology On-Chip Networks [0.0]
Graph Convolutional Networks (GCNs) are state-of-the-art deep learning models for representation learning on graphs. We propose a message-passing architecture that leverages NUMA-based memory access properties. We also re-engineered the backpropagation algorithm specific to GCNs within our proposed accelerator.
arXiv Detail & Related papers (2024-11-06T12:00:51Z)
FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs [0.0]
Transformer neural networks (TNNs) are being applied across a widening range of application domains, including natural language processing (NLP), machine translation, and computer vision (CV) This paper proposes textitFAMOUS, a flexible hardware accelerator for dense multi-head attention computation of TNNs on field-programmable gate arrays (FPGAs) It is optimized for high utilization of processing elements and on-chip memories to improve parallelism and reduce latency.
arXiv Detail & Related papers (2024-09-21T05:25:46Z)
TrIM: Triangular Input Movement Systolic Array for Convolutional Neural Networks -- Part I: Dataflow and Analytical Modelling [0.0]
Convolutional Neural Networks (CNNs) are particularly susceptible to the Von Neumann bottleneck. In this paper, we propose TrIM: a novel dataflow for Systolic Arrays based on a Triangular Input Movement.
arXiv Detail & Related papers (2024-08-02T13:15:17Z)
fVDB: A Deep-Learning Framework for Sparse, Large-Scale, and High-Performance Spatial Intelligence [50.417261057533786]
fVDB is a novel framework for deep learning on large-scale 3D data. Our framework is fully integrated with PyTorch enabling interoperability with existing pipelines.
arXiv Detail & Related papers (2024-07-01T20:20:33Z)
Efficient and accurate neural field reconstruction using resistive memory [52.68088466453264]
Traditional signal reconstruction methods on digital computers face both software and hardware challenges. We propose a systematic approach with software-hardware co-optimizations for signal reconstruction from sparse inputs. This work advances the AI-driven signal restoration technology and paves the way for future efficient and robust medical AI and 3D vision applications.
arXiv Detail & Related papers (2024-04-15T09:33:09Z)
SMOF: Streaming Modern CNNs on FPGAs with Smart Off-Chip Eviction [6.800641017055453]
This paper introduces weight and activation eviction mechanisms to off-chip memory along the computational pipeline. The proposed mechanism is incorporated into an existing toolflow, expanding the design space by utilising off-chip memory as a buffer. SMOF has demonstrated the capacity to deliver competitive and, in some cases, state-of-the-art performance across a spectrum of computer vision tasks.
arXiv Detail & Related papers (2024-03-27T18:12:24Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
GLEAM: Greedy Learning for Large-Scale Accelerated MRI Reconstruction [50.248694764703714]
Unrolled neural networks have recently achieved state-of-the-art accelerated MRI reconstruction. These networks unroll iterative optimization algorithms by alternating between physics-based consistency and neural-network based regularization. We propose Greedy LEarning for Accelerated MRI reconstruction, an efficient training strategy for high-dimensional imaging settings.
arXiv Detail & Related papers (2022-07-18T06:01:29Z)
FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task. The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources. It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z)
Reconfigurable co-processor architecture with limited numerical precision to accelerate deep convolutional neural networks [0.38848561367220275]
Convolutional Neural Networks (CNNs) are widely used in deep learning applications, e.g. visual systems, robotics etc. Here, we present a model-independent reconfigurable co-processing architecture to accelerate CNNs. In contrast to existing solutions, we introduce limited precision 32 bit Q-format fixed point quantization for arithmetic representations and operations.
arXiv Detail & Related papers (2021-08-21T09:50:54Z)
VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator. textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z)
Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration [14.958793135751149]
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM) Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead. We address a key architectural challenge with structural sparsity: how to provide support for a range of sparsity levels while maintaining high utilization of the hardware.
arXiv Detail & Related papers (2020-09-04T20:17:42Z)
One-step regression and classification with crosspoint resistive memory arrays [62.997667081978825]
High speed, low energy computing machines are in demand to enable real-time artificial intelligence at the edge. One-step learning is supported by simulations of the prediction of the cost of a house in Boston and the training of a 2-layer neural network for MNIST digit recognition. Results are all obtained in one computational step, thanks to the physical, parallel, and analog computing within the crosspoint array.
arXiv Detail & Related papers (2020-05-05T08:00:07Z)
PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.