HASS: Hardware-Aware Sparsity Search for Dataflow DNN Accelerator
- URL: http://arxiv.org/abs/2406.03088v1
- Date: Wed, 5 Jun 2024 09:25:18 GMT
- Title: HASS: Hardware-Aware Sparsity Search for Dataflow DNN Accelerator
- Authors: Zhewen Yu, Sudarshan Sreeram, Krish Agrawal, Junyi Wu, Alexander Montgomerie-Corcoran, Cheng Zhang, Jianyi Cheng, Christos-Savvas Bouganis, Yiren Zhao,
- Abstract summary: We propose a novel approach to exploit unstructured weights and activations sparsity for dataflow accelerators, using software and hardware co-optimization.
We achieve an efficiency improvement ranging from 1.3$times$ to 4.2$times$ compared to existing sparse designs.
- Score: 47.66463010685586
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep Neural Networks (DNNs) excel in learning hierarchical representations from raw data, such as images, audio, and text. To compute these DNN models with high performance and energy efficiency, these models are usually deployed onto customized hardware accelerators. Among various accelerator designs, dataflow architecture has shown promising performance due to its layer-pipelined structure and its scalability in data parallelism. Exploiting weights and activations sparsity can further enhance memory storage and computation efficiency. However, existing approaches focus on exploiting sparsity in non-dataflow accelerators, which cannot be applied onto dataflow accelerators because of the large hardware design space introduced. As such, this could miss opportunities to find an optimal combination of sparsity features and hardware designs. In this paper, we propose a novel approach to exploit unstructured weights and activations sparsity for dataflow accelerators, using software and hardware co-optimization. We propose a Hardware-Aware Sparsity Search (HASS) to systematically determine an efficient sparsity solution for dataflow accelerators. Over a set of models, we achieve an efficiency improvement ranging from 1.3$\times$ to 4.2$\times$ compared to existing sparse designs, which are either non-dataflow or non-hardware-aware. Particularly, the throughput of MobileNetV3 can be optimized to 4895 images per second. HASS is open-source: \url{https://github.com/Yu-Zhewen/HASS}
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - DCP: Learning Accelerator Dataflow for Neural Network via Propagation [52.06154296196845]
This work proposes an efficient data-centric approach, named Dataflow Code Propagation (DCP), to automatically find the optimal dataflow for DNN layers in seconds without human effort.
DCP learns a neural predictor to efficiently update the dataflow codes towards the desired gradient directions to minimize various optimization objectives.
For example, without using additional training data, DCP surpasses the GAMMA method that performs a full search using thousands of samples.
arXiv Detail & Related papers (2024-10-09T05:16:44Z) - FlexNN: A Dataflow-aware Flexible Deep Learning Accelerator for Energy-Efficient Edge Devices [0.6892601897291335]
This paper introduces FlexNN, a Flexible Neural Network accelerator, which adopts agile design principles.
Our design revolutionizes by enabling adaptable dataflows of any type through software descriptors.
To further enhance throughput and reduce energy consumption, we propose a novel sparsity-based acceleration logic.
arXiv Detail & Related papers (2024-03-14T01:39:12Z) - SATAY: A Streaming Architecture Toolflow for Accelerating YOLO Models on
FPGA Devices [48.47320494918925]
This work tackles the challenges of deploying stateof-the-art object detection models onto FPGA devices for ultralow latency applications.
We employ a streaming architecture design for our YOLO accelerators, implementing the complete model on-chip in a deeply pipelined fashion.
We introduce novel hardware components to support the operations of YOLO models in a dataflow manner, and off-chip memory buffering to address the limited on-chip memory resources.
arXiv Detail & Related papers (2023-09-04T13:15:01Z) - KAPLA: Pragmatic Representation and Fast Solving of Scalable NN
Accelerator Dataflow [0.0]
We build a generic, optimized, and fast dataflow solver, KAPLA, to explore the design space with effective validity check and efficiency estimation.
KAPLA achieves within only 2.2% and 7.7% energy overheads on the result dataflow for training and inference.
It also outperforms random and machine-learning-based approaches, with more optimized results and orders of magnitude faster search speedup.
arXiv Detail & Related papers (2023-06-09T03:12:42Z) - Data-Model-Circuit Tri-Design for Ultra-Light Video Intelligence on Edge
Devices [90.30316433184414]
We propose a data-model-hardware tri-design framework for high- throughput, low-cost, and high-accuracy MOT on HD video stream.
Compared to the state-of-the-art MOT baseline, our tri-design approach can achieve 12.5x latency reduction, 20.9x effective frame rate improvement, 5.83x lower power, and 9.78x better energy efficiency, without much accuracy drop.
arXiv Detail & Related papers (2022-10-16T16:21:40Z) - FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks.
Current networks often occupy large number of parameters and require heavy computation costs.
Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z) - Searching for Fast Model Families on Datacenter Accelerators [33.28421782921072]
We search for fast and accurate CNN model families for efficient inference on DC accelerators.
We propose a latency-aware compound scaling (LACS) method optimizing both accuracy and latency.
Our LACS discovers that network depth should grow much faster than image size and network width.
arXiv Detail & Related papers (2021-02-10T18:15:40Z) - ConfuciuX: Autonomous Hardware Resource Assignment for DNN Accelerators
using Reinforcement Learning [5.251940442946459]
We propose an autonomous strategy called ConfuciuX to find optimized HW resource assignments for a given model and dataflow style.
It converges to the optimized hardware configuration 4.7 to 24 times faster than alternate techniques.
arXiv Detail & Related papers (2020-09-04T04:59:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.