Architectural Implications of Neural Network Inference for High Data-Rate, Low-Latency Scientific Applications
- URL: http://arxiv.org/abs/2403.08980v1
- Date: Wed, 13 Mar 2024 22:10:42 GMT
- Title: Architectural Implications of Neural Network Inference for High Data-Rate, Low-Latency Scientific Applications
- Authors: Olivia Weng, Alexander Redding, Nhan Tran, Javier Mauricio Duarte, Ryan Kastner,
- Abstract summary: We show that many scientific NN applications must run fully on chip, in the extreme case requiring a custom chip to meet such stringent constraints.
In our work, we show that many scientific NN applications must run fully on chip, in the extreme case requiring a custom chip to meet such stringent constraints.
- Score: 43.60059930708406
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With more scientific fields relying on neural networks (NNs) to process data incoming at extreme throughputs and latencies, it is crucial to develop NNs with all their parameters stored on-chip. In many of these applications, there is not enough time to go off-chip and retrieve weights. Even more so, off-chip memory such as DRAM does not have the bandwidth required to process these NNs as fast as the data is being produced (e.g., every 25 ns). As such, these extreme latency and bandwidth requirements have architectural implications for the hardware intended to run these NNs: 1) all NN parameters must fit on-chip, and 2) codesigning custom/reconfigurable logic is often required to meet these latency and bandwidth constraints. In our work, we show that many scientific NN applications must run fully on chip, in the extreme case requiring a custom chip to meet such stringent constraints.
Related papers
- RNC: Efficient RRAM-aware NAS and Compilation for DNNs on Resource-Constrained Edge Devices [0.30458577208819987]
We aim to develop edge-friendly deep neural networks (DNNs) for accelerators based on resistive random-access memory (RRAM)
We propose an edge compilation and resource-constrained RRAM-aware neural architecture search (NAS) framework to search for optimized neural networks meeting specific hardware constraints.
The resulting model from NAS optimized for speed achieved 5x-30x speedup.
arXiv Detail & Related papers (2024-09-27T15:35:36Z) - Spiker+: a framework for the generation of efficient Spiking Neural
Networks FPGA accelerators for inference at the edge [49.42371633618761]
Spiker+ is a framework for generating efficient, low-power, and low-area customized Spiking Neural Networks (SNN) accelerators on FPGA for inference at the edge.
Spiker+ is tested on two benchmark datasets, the MNIST and the Spiking Heidelberg Digits (SHD)
arXiv Detail & Related papers (2024-01-02T10:42:42Z) - Sparsifying Binary Networks [3.8350038566047426]
Binary neural networks (BNNs) have demonstrated their ability to solve complex tasks with comparable accuracy as full-precision deep neural networks (DNNs)
Despite the recent improvements, they suffer from a fixed and limited compression factor that may result insufficient for certain devices with very limited resources.
We propose sparse binary neural networks (SBNNs), a novel model and training scheme which introduces sparsity in BNNs and a new quantization function for binarizing the network's weights.
arXiv Detail & Related papers (2022-07-11T15:54:41Z) - Sub-bit Neural Networks: Learning to Compress and Accelerate Binary
Neural Networks [72.81092567651395]
Sub-bit Neural Networks (SNNs) are a new type of binary quantization design tailored to compress and accelerate BNNs.
SNNs are trained with a kernel-aware optimization framework, which exploits binary quantization in the fine-grained convolutional kernel space.
Experiments on visual recognition benchmarks and the hardware deployment on FPGA validate the great potentials of SNNs.
arXiv Detail & Related papers (2021-10-18T11:30:29Z) - Towards Memory-Efficient Neural Networks via Multi-Level in situ
Generation [10.563649948220371]
Deep neural networks (DNN) have shown superior performance in a variety of tasks.
As they rapidly evolve, their escalating computation and memory demands make it challenging to deploy them on resource-constrained edge devices.
We propose a general and unified framework to trade expensive memory transactions with ultra-fast on-chip computations.
arXiv Detail & Related papers (2021-08-25T18:50:24Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - Binary Graph Neural Networks [69.51765073772226]
Graph Neural Networks (GNNs) have emerged as a powerful and flexible framework for representation learning on irregular data.
In this paper, we present and evaluate different strategies for the binarization of graph neural networks.
We show that through careful design of the models, and control of the training process, binary graph neural networks can be trained at only a moderate cost in accuracy on challenging benchmarks.
arXiv Detail & Related papers (2020-12-31T18:48:58Z) - Efficient Synthesis of Compact Deep Neural Networks [17.362146401041528]
Deep neural networks (DNNs) have been deployed in myriad machine learning applications.
These large, deep models are often unsuitable for real-world applications, due to their massive computational cost, high memory bandwidth, and long latency.
In this paper, we review major approaches for automatically synthesizing compact, yet accurate, DNN/LSTM models suitable for real-world applications.
arXiv Detail & Related papers (2020-04-18T21:20:04Z) - Data-Driven Neuromorphic DRAM-based CNN and RNN Accelerators [13.47462920292399]
The energy consumed by running large deep neural networks (DNNs) on hardware accelerators is dominated by the need for lots of fast memory to store both states and weights.
Although DRAM is high-cost and low-cost memory (costing 20X less than DRAM), its long random access latency is bad for the unpredictable access patterns in spiking neural networks (SNNs)
This paper reports on our developments over the last 5 years of convolutional and recurrent deep neural network hardware accelerators that exploit either spatial or temporal sparsity similar to SNNs but achieve SOA throughput, power efficiency and latency even with the use
arXiv Detail & Related papers (2020-03-29T11:45:53Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.