Vega: A 10-Core SoC for IoT End-Nodes with DNN Acceleration and
Cognitive Wake-Up From MRAM-Based State-Retentive Sleep Mode
- URL: http://arxiv.org/abs/2110.09101v1
- Date: Mon, 18 Oct 2021 08:47:45 GMT
- Title: Vega: A 10-Core SoC for IoT End-Nodes with DNN Acceleration and
Cognitive Wake-Up From MRAM-Based State-Retentive Sleep Mode
- Authors: Davide Rossi, Francesco Conti, Manuel Eggimann, Alfio Di Mauro,
Giuseppe Tagliavini, Stefan Mach, Marco Guermandi, Antonio Pullini, Igor Loi,
Jie Chen, Eric Flamand, Luca Benini
- Abstract summary: Vega is an IoT end-node system capable of scaling from a 1.7 $mathrmmuW fully retentive cognitive sleep mode up to 32.2 GOPS (@ 49.4 mW) peak on NSAAs.
Vega achieves SoA-leading efficiency of 615 GOPS/W on 8-bit INT and 79 and 129 GFLOPS/W on 32- and 16-bit FP.
- Score: 14.214500730272256
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Internet-of-Things requires end-nodes with ultra-low-power always-on
capability for a long battery lifetime, as well as high performance, energy
efficiency, and extreme flexibility to deal with complex and fast-evolving
near-sensor analytics algorithms (NSAAs). We present Vega, an IoT end-node SoC
capable of scaling from a 1.7 $\mathrm{\mu}$W fully retentive cognitive sleep
mode up to 32.2 GOPS (@ 49.4 mW) peak performance on NSAAs, including mobile
DNN inference, exploiting 1.6 MB of state-retentive SRAM, and 4 MB of
non-volatile MRAM. To meet the performance and flexibility requirements of
NSAAs, the SoC features 10 RISC-V cores: one core for SoC and IO management and
a 9-cores cluster supporting multi-precision SIMD integer and floating-point
computation. Vega achieves SoA-leading efficiency of 615 GOPS/W on 8-bit INT
computation (boosted to 1.3TOPS/W for 8-bit DNN inference with hardware
acceleration). On floating-point (FP) compuation, it achieves SoA-leading
efficiency of 79 and 129 GFLOPS/W on 32- and 16-bit FP, respectively. Two
programmable machine-learning (ML) accelerators boost energy efficiency in
cognitive sleep and active states, respectively.
Related papers
- Spiker+: a framework for the generation of efficient Spiking Neural
Networks FPGA accelerators for inference at the edge [49.42371633618761]
Spiker+ is a framework for generating efficient, low-power, and low-area customized Spiking Neural Networks (SNN) accelerators on FPGA for inference at the edge.
Spiker+ is tested on two benchmark datasets, the MNIST and the Spiking Heidelberg Digits (SHD)
arXiv Detail & Related papers (2024-01-02T10:42:42Z) - Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks.
It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping.
It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - BEANNA: A Binary-Enabled Architecture for Neural Network Acceleration [0.0]
This paper proposes and evaluates a neural network hardware accelerator capable of processing both floating point and binary network layers.
Running at a clock speed of 100MHz, BEANNA achieves a peak throughput of 52.8 GigaOps/second.
arXiv Detail & Related papers (2021-08-04T23:17:34Z) - Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters.
It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions.
We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z) - AdderNet and its Minimalist Hardware Design for Energy-Efficient
Artificial Intelligence [111.09105910265154]
We present a novel minimalist hardware architecture using adder convolutional neural network (AdderNet)
The whole AdderNet can practically achieve 16% enhancement in speed.
We conclude the AdderNet is able to surpass all the other competitors.
arXiv Detail & Related papers (2021-01-25T11:31:52Z) - Sound Event Detection with Binary Neural Networks on Tightly
Power-Constrained IoT Devices [20.349809458335532]
Sound event detection (SED) is a hot topic in consumer and smart city applications.
Existing approaches based on Deep Neural Networks are very effective, but highly demanding in terms of memory, power, and throughput.
In this paper, we explore the combination of extreme quantization to a small-print binary neural network (BNN) with the highly energy-efficient, RISC-V-based (8+1)-core GAP8 microcontroller.
arXiv Detail & Related papers (2021-01-12T12:38:23Z) - EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware
Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks.
We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z) - RNNAccel: A Fusion Recurrent Neural Network Accelerator for Edge
Intelligence [2.055204980188575]
We present an RNN deep learning accelerator, called RNNAccel.
It supports Long Short-Term Memory (LSTM) network, Gated Recurrent Unit (GRU) network, and Fully Connected Layer (FC)/ Multiple-Perceptron Layer (MLP) networks.
The 32-MAC RNN accelerator achieves 90% MAC utilization, 1.27 TOPs/W at 40nm process, 8x compression ratio, and 90% inference accuracy.
arXiv Detail & Related papers (2020-10-26T03:36:36Z) - DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT
MCUs [6.403349961091506]
Low-Cost MCU-based end-nodes have limited on-chip memory and often replace caches with scratchpads.
DORY is an automatic tool to deploys on low cost MCUs with typically less than 1MB on-chip memory.
arXiv Detail & Related papers (2020-08-17T07:30:54Z) - Q-EEGNet: an Energy-Efficient 8-bit Quantized Parallel EEGNet
Implementation for Edge Motor-Imagery Brain--Machine Interfaces [16.381467082472515]
Motor-Imagery Brain--Machine Interfaces (MI-BMIs)promise direct and accessible communication between human brains and machines.
Deep learning models have emerged for classifying EEG signals.
These models often exceed the limitations of edge devices due to their memory and computational requirements.
arXiv Detail & Related papers (2020-04-24T12:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.