Related papers: Multiply-and-Fire (MNF): An Event-driven Sparse Neural Network Accelerator

Multiply-and-Fire (MNF): An Event-driven Sparse Neural Network Accelerator

URL: http://arxiv.org/abs/2204.09797v1
Date: Wed, 20 Apr 2022 21:56:50 GMT
Title: Multiply-and-Fire (MNF): An Event-driven Sparse Neural Network Accelerator
Authors: Miao Yu, Tingting Xiang, Venkata Pavan Kumar Miriyala, Trevor E. Carlson
Abstract summary: This work takes a unique look at sparsity with an event (or activation-driven) approach to ANN acceleration. Our analytical and experimental results show that this event-driven solution presents a new direction to enable highly efficient AI inference for both CNN and workloads.
Score: 3.224364382976958
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine learning, particularly deep neural network inference, has become a vital workload for many computing systems, from data centers and HPC systems to edge-based computing. As advances in sparsity have helped improve the efficiency of AI acceleration, there is a continued need for improved system efficiency for both high-performance and system-level acceleration. This work takes a unique look at sparsity with an event (or activation-driven) approach to ANN acceleration that aims to minimize useless work, improve utilization, and increase performance and energy efficiency. Our analytical and experimental results show that this event-driven solution presents a new direction to enable highly efficient AI inference for both CNN and MLP workloads. This work demonstrates state-of-the-art energy efficiency and performance centring on activation-based sparsity and a highly-parallel dataflow method that improves the overall functional unit utilization (at 30 fps). This work enhances energy efficiency over a state-of-the-art solution by 1.46$\times$. Taken together, this methodology presents a novel, new direction to achieve high-efficiency, high-performance designs for next-generation AI acceleration platforms.

Related papers

iFlame: Interleaving Full and Linear Attention for Efficient Mesh Generation [49.8026360054331]
iFlame is a novel transformer-based network architecture for mesh generation. We propose an interleaving autoregressive mesh generation framework that combines the efficiency of linear attention with the expressive power of full attention mechanisms. Our results indicate that the proposed interleaving framework effectively balances computational efficiency and generative performance.
arXiv Detail & Related papers (2025-03-20T19:10:37Z)
Task-Specific Activation Functions for Neuroevolution using Grammatical Evolution [0.0]
We introduce Neuvo GEAF - an innovative approach leveraging grammatical evolution (GE) to automatically evolve novel activation functions. Experiments conducted on well-known binary classification datasets show statistically significant improvements in F1-score (between 2.4% and 9.4%) over ReLU.
arXiv Detail & Related papers (2025-03-13T20:50:21Z)
USEFUSE: Utile Stride for Enhanced Performance in Fused Layer Architecture of Deep Neural Networks [0.6435156676256051]
This study presents the Sum-of-Products (SOP) units for convolution, which utilize low-latency left-to-right bit-serial arithmetic. An effective mechanism detects and skips inefficient convolutions after ReLU layers, minimizing power consumption. Two designs cater to varied demands: one focuses on minimal response time for mission-critical applications, and another focuses on resource-constrained devices with comparable latency.
arXiv Detail & Related papers (2024-12-18T11:04:58Z)
big.LITTLE Vision Transformer for Efficient Visual Recognition [34.015778625984055]
big.LITTLE Vision Transformer is an innovative architecture aimed at achieving efficient visual recognition. System is composed of two distinct blocks: the big performance block and the LITTLE efficiency block. When processing an image, our system determines the importance of each token and allocates them accordingly.
arXiv Detail & Related papers (2024-10-14T08:21:00Z)
Efficient Federated Learning Using Dynamic Update and Adaptive Pruning with Momentum on Shared Server Data [59.6985168241067]
Federated Learning (FL) encounters two important problems, i.e., low training efficiency and limited computational resources. We propose a new FL framework, FedDUMAP, to leverage the shared insensitive data on the server and the distributed data in edge devices. Our proposed FL model, FedDUMAP, combines the three original techniques and has a significantly better performance compared with baseline approaches.
arXiv Detail & Related papers (2024-08-11T02:59:11Z)
Center-Sensitive Kernel Optimization for Efficient On-Device Incremental Learning [88.78080749909665]
Current on-device training methods just focus on efficient training without considering the catastrophic forgetting. This paper proposes a simple but effective edge-friendly incremental learning framework. Our method achieves average accuracy boost of 38.08% with even less memory and approximate computation.
arXiv Detail & Related papers (2024-06-13T05:49:29Z)
Augmenting the FedProx Algorithm by Minimizing Convergence [0.0]
We present a novel approach called G Federated Proximity. Our results indicate a significant increase in the throughput of approximately 90% better convergence compared to existing model performance.
arXiv Detail & Related papers (2024-06-02T14:01:55Z)
Accelerating Neural Network Training: A Brief Review [0.5825410941577593]
This study examines innovative approaches to expedite the training process of deep neural networks (DNN) The research utilizes sophisticated methodologies, including Gradient Accumulation (GA), Automatic Mixed Precision (AMP), and Pin Memory (PM)
arXiv Detail & Related papers (2023-12-15T18:43:45Z)
Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks. It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping. It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z)
FedDUAP: Federated Learning with Dynamic Update and Adaptive Pruning Using Shared Data on the Server [64.94942635929284]
Federated Learning (FL) suffers from two critical challenges, i.e., limited computational resources and low training efficiency. We propose a novel FL framework, FedDUAP, to exploit the insensitive data on the server and the decentralized data in edge devices. By integrating the two original techniques together, our proposed FL model, FedDUAP, significantly outperforms baseline approaches in terms of accuracy (up to 4.8% higher), efficiency (up to 2.8 times faster), and computational cost (up to 61.9% smaller)
arXiv Detail & Related papers (2022-04-25T10:00:00Z)
FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs) Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z)
Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples. We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment. We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z)
Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training [8.411385346896413]
We focus on improving the practical efficiency of the state-of-the-art EfficientNet models on a new class of accelerator, the Graphcore IPU. We do this by extending this family of models in the following ways: (i) generalising depthwise convolutions to group convolutions; (ii) adding proxy-normalized activations to match batch normalization performance with batch-independent statistics; and (iii) reducing compute by lowering the training resolution and inexpensively fine-tuning at higher resolution.
arXiv Detail & Related papers (2021-06-07T14:10:52Z)
AutoScale: Optimizing Energy Efficiency of End-to-End Edge Inference under Stochastic Variance [11.093360539563657]
AutoScale is an adaptive and light-weight execution scaling engine built upon the custom-designed reinforcement learning algorithm. This paper proposes AutoScale to enable accurate, energy-efficient deep learning inference at the edge.
arXiv Detail & Related papers (2020-05-06T00:30:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.