Multiply-and-Fire (MNF): An Event-driven Sparse Neural Network
Accelerator
- URL: http://arxiv.org/abs/2204.09797v1
- Date: Wed, 20 Apr 2022 21:56:50 GMT
- Title: Multiply-and-Fire (MNF): An Event-driven Sparse Neural Network
Accelerator
- Authors: Miao Yu, Tingting Xiang, Venkata Pavan Kumar Miriyala, Trevor E.
Carlson
- Abstract summary: This work takes a unique look at sparsity with an event (or activation-driven) approach to ANN acceleration.
Our analytical and experimental results show that this event-driven solution presents a new direction to enable highly efficient AI inference for both CNN and workloads.
- Score: 3.224364382976958
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning, particularly deep neural network inference, has become a
vital workload for many computing systems, from data centers and HPC systems to
edge-based computing. As advances in sparsity have helped improve the
efficiency of AI acceleration, there is a continued need for improved system
efficiency for both high-performance and system-level acceleration.
This work takes a unique look at sparsity with an event (or
activation-driven) approach to ANN acceleration that aims to minimize useless
work, improve utilization, and increase performance and energy efficiency. Our
analytical and experimental results show that this event-driven solution
presents a new direction to enable highly efficient AI inference for both CNN
and MLP workloads.
This work demonstrates state-of-the-art energy efficiency and performance
centring on activation-based sparsity and a highly-parallel dataflow method
that improves the overall functional unit utilization (at 30 fps). This work
enhances energy efficiency over a state-of-the-art solution by 1.46$\times$.
Taken together, this methodology presents a novel, new direction to achieve
high-efficiency, high-performance designs for next-generation AI acceleration
platforms.
Related papers
- big.LITTLE Vision Transformer for Efficient Visual Recognition [34.015778625984055]
big.LITTLE Vision Transformer is an innovative architecture aimed at achieving efficient visual recognition.
System is composed of two distinct blocks: the big performance block and the LITTLE efficiency block.
When processing an image, our system determines the importance of each token and allocates them accordingly.
arXiv Detail & Related papers (2024-10-14T08:21:00Z) - Efficient Federated Learning Using Dynamic Update and Adaptive Pruning with Momentum on Shared Server Data [59.6985168241067]
Federated Learning (FL) encounters two important problems, i.e., low training efficiency and limited computational resources.
We propose a new FL framework, FedDUMAP, to leverage the shared insensitive data on the server and the distributed data in edge devices.
Our proposed FL model, FedDUMAP, combines the three original techniques and has a significantly better performance compared with baseline approaches.
arXiv Detail & Related papers (2024-08-11T02:59:11Z) - Center-Sensitive Kernel Optimization for Efficient On-Device Incremental Learning [88.78080749909665]
Current on-device training methods just focus on efficient training without considering the catastrophic forgetting.
This paper proposes a simple but effective edge-friendly incremental learning framework.
Our method achieves average accuracy boost of 38.08% with even less memory and approximate computation.
arXiv Detail & Related papers (2024-06-13T05:49:29Z) - Augmenting the FedProx Algorithm by Minimizing Convergence [0.0]
We present a novel approach called G Federated Proximity.
Our results indicate a significant increase in the throughput of approximately 90% better convergence compared to existing model performance.
arXiv Detail & Related papers (2024-06-02T14:01:55Z) - Accelerating Neural Network Training: A Brief Review [0.5825410941577593]
This study examines innovative approaches to expedite the training process of deep neural networks (DNN)
The research utilizes sophisticated methodologies, including Gradient Accumulation (GA), Automatic Mixed Precision (AMP), and Pin Memory (PM)
arXiv Detail & Related papers (2023-12-15T18:43:45Z) - Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks.
It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping.
It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z) - FedDUAP: Federated Learning with Dynamic Update and Adaptive Pruning
Using Shared Data on the Server [64.94942635929284]
Federated Learning (FL) suffers from two critical challenges, i.e., limited computational resources and low training efficiency.
We propose a novel FL framework, FedDUAP, to exploit the insensitive data on the server and the decentralized data in edge devices.
By integrating the two original techniques together, our proposed FL model, FedDUAP, significantly outperforms baseline approaches in terms of accuracy (up to 4.8% higher), efficiency (up to 2.8 times faster), and computational cost (up to 61.9% smaller)
arXiv Detail & Related papers (2022-04-25T10:00:00Z) - FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs)
Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z) - Making EfficientNet More Efficient: Exploring Batch-Independent
Normalization, Group Convolutions and Reduced Resolution Training [8.411385346896413]
We focus on improving the practical efficiency of the state-of-the-art EfficientNet models on a new class of accelerator, the Graphcore IPU.
We do this by extending this family of models in the following ways: (i) generalising depthwise convolutions to group convolutions; (ii) adding proxy-normalized activations to match batch normalization performance with batch-independent statistics; and (iii) reducing compute by lowering the training resolution and inexpensively fine-tuning at higher resolution.
arXiv Detail & Related papers (2021-06-07T14:10:52Z) - AutoScale: Optimizing Energy Efficiency of End-to-End Edge Inference
under Stochastic Variance [11.093360539563657]
AutoScale is an adaptive and light-weight execution scaling engine built upon the custom-designed reinforcement learning algorithm.
This paper proposes AutoScale to enable accurate, energy-efficient deep learning inference at the edge.
arXiv Detail & Related papers (2020-05-06T00:30:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.