TrIM: Triangular Input Movement Systolic Array for Convolutional Neural Networks -- Part I: Dataflow and Analytical Modelling
- URL: http://arxiv.org/abs/2408.01254v1
- Date: Fri, 2 Aug 2024 13:15:17 GMT
- Title: TrIM: Triangular Input Movement Systolic Array for Convolutional Neural Networks -- Part I: Dataflow and Analytical Modelling
- Authors: Cristian Sestito, Shady Agwa, Themis Prodromakis,
- Abstract summary: Convolutional Neural Networks (CNNs) are particularly susceptible to the Von Neumann bottleneck.
In this paper, we propose TrIM: a novel dataflow for Systolic Arrays based on a Triangular Input Movement.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In order to follow the ever-growing computational complexity and data intensity of state-of-the-art AI models, new computing paradigms are being proposed. These paradigms aim at achieving high energy efficiency, by mitigating the Von Neumann bottleneck that relates to the energy cost of moving data between the processing cores and the memory. Convolutional Neural Networks (CNNs) are particularly susceptible to this bottleneck, given the massive data they have to manage. Systolic Arrays (SAs) are promising architectures to mitigate the data transmission cost, thanks to high data utilization carried out by an array of Processing Elements (PEs). These PEs continuously exchange and process data locally based on specific dataflows (like weight stationary and row stationary), in turn reducing the number of memory accesses to the main memory. The hardware specialization of SAs can meet different workloads, ranging from matrix multiplications to multi-dimensional convolutions. In this paper, we propose TrIM: a novel dataflow for SAs based on a Triangular Input Movement and compatible with CNN computing. When compared to state-of-the-art SA dataflows, like weight stationary and row stationary, the high data utilization offered by TrIM guarantees ~10x less memory access. Furthermore, considering that PEs continuously overlap multiplications and accumulations, TrIM achieves high throughput (up to 81.8% higher than row stationary), other than requiring a limited number of registers (up to 15.6x fewer registers than row stationary).
Related papers
- Neuromorphic Wireless Split Computing with Multi-Level Spikes [69.73249913506042]
In neuromorphic computing, spiking neural networks (SNNs) perform inference tasks, offering significant efficiency gains for workloads involving sequential data.
Recent advances in hardware and software have demonstrated that embedding a few bits of payload in each spike exchanged between the spiking neurons can further enhance inference accuracy.
This paper investigates a wireless neuromorphic split computing architecture employing multi-level SNNs.
arXiv Detail & Related papers (2024-11-07T14:08:35Z) - TrIM: Triangular Input Movement Systolic Array for Convolutional Neural Networks -- Part II: Architecture and Hardware Implementation [0.0]
TrIM is an innovative dataflow based on a triangular movement of inputs.
TrIM can reduce the number of memory accesses by one order of magnitude when compared to state-of-the-art systolic arrays.
architecture achieves a peak throughput of 453.6 Giga Operations per Second.
arXiv Detail & Related papers (2024-08-05T10:18:00Z) - Efficient and accurate neural field reconstruction using resistive memory [52.68088466453264]
Traditional signal reconstruction methods on digital computers face both software and hardware challenges.
We propose a systematic approach with software-hardware co-optimizations for signal reconstruction from sparse inputs.
This work advances the AI-driven signal restoration technology and paves the way for future efficient and robust medical AI and 3D vision applications.
arXiv Detail & Related papers (2024-04-15T09:33:09Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - GLEAM: Greedy Learning for Large-Scale Accelerated MRI Reconstruction [50.248694764703714]
Unrolled neural networks have recently achieved state-of-the-art accelerated MRI reconstruction.
These networks unroll iterative optimization algorithms by alternating between physics-based consistency and neural-network based regularization.
We propose Greedy LEarning for Accelerated MRI reconstruction, an efficient training strategy for high-dimensional imaging settings.
arXiv Detail & Related papers (2022-07-18T06:01:29Z) - Deep Neural Networks Based Weight Approximation and Computation Reuse
for 2-D Image Classification [0.9507070656654631]
Deep Neural Networks (DNNs) are computationally and memory intensive.
This paper introduces a new method to improve DNNs performance by fusing approximate computing with data reuse techniques.
It is suitable for IoT edge devices as it reduces the memory size requirement as well as the number of needed memory accesses.
arXiv Detail & Related papers (2021-04-28T10:16:53Z) - Deep Cellular Recurrent Network for Efficient Analysis of Time-Series
Data with Spatial Information [52.635997570873194]
This work proposes a novel deep cellular recurrent neural network (DCRNN) architecture to process complex multi-dimensional time series data with spatial information.
The proposed architecture achieves state-of-the-art performance while utilizing substantially less trainable parameters when compared to comparable methods in the literature.
arXiv Detail & Related papers (2021-01-12T20:08:18Z) - Robust High-dimensional Memory-augmented Neural Networks [13.82206983716435]
Memory-augmented neural networks enhance neural networks with an explicit memory to overcome these issues.
Access to this explicit memory occurs via soft read and write operations involving every individual memory entry.
We propose a robust architecture that employs a computational memory unit as the explicit memory performing analog in-memory computation on high-dimensional (HD) vectors.
arXiv Detail & Related papers (2020-10-05T12:01:56Z) - Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration [14.958793135751149]
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM)
Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead.
We address a key architectural challenge with structural sparsity: how to provide support for a range of sparsity levels while maintaining high utilization of the hardware.
arXiv Detail & Related papers (2020-09-04T20:17:42Z) - One-step regression and classification with crosspoint resistive memory
arrays [62.997667081978825]
High speed, low energy computing machines are in demand to enable real-time artificial intelligence at the edge.
One-step learning is supported by simulations of the prediction of the cost of a house in Boston and the training of a 2-layer neural network for MNIST digit recognition.
Results are all obtained in one computational step, thanks to the physical, parallel, and analog computing within the crosspoint array.
arXiv Detail & Related papers (2020-05-05T08:00:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.