Related papers: DeltaKWS: A 65nm 36nJ/Decision Bio-inspired Temporal-Sparsity-Aware Digital Keyword Spotting IC with 0.6V Near-Threshold SRAM

DeltaKWS: A 65nm 36nJ/Decision Bio-inspired Temporal-Sparsity-Aware Digital Keyword Spotting IC with 0.6V Near-Threshold SRAM

URL: http://arxiv.org/abs/2405.03905v2
Date: Tue, 26 Nov 2024 15:37:57 GMT
Title: DeltaKWS: A 65nm 36nJ/Decision Bio-inspired Temporal-Sparsity-Aware Digital Keyword Spotting IC with 0.6V Near-Threshold SRAM
Authors: Qinyu Chen, Kwantae Kim, Chang Gao, Sheng Zhou, Taekwang Jang, Tobi Delbruck, Shih-Chii Liu,
Abstract summary: This paper introduces the first $Delta$RNN-enabled fine-grained temporal sparsity-aware KWS IC for voice-controlled devices. At 87% temporal sparsity, computing latency and energy/ferencein are reduced by 2.4X/3.4X, respectively.
Score: 16.1102923955667
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces DeltaKWS, to the best of our knowledge, the first $\Delta$RNN-enabled fine-grained temporal sparsity-aware KWS IC for voice-controlled devices. The 65 nm prototype chip features a number of techniques to enhance performance, area, and power efficiencies, specifically: 1) a bio-inspired delta-gated recurrent neural network ($\Delta$RNN) classifier leveraging temporal similarities between neighboring feature vectors extracted from input frames and network hidden states, eliminating unnecessary operations and memory accesses; 2) an IIR BPF-based FEx that leverages mixed-precision quantization, low-cost computing structure and channel selection; 3) a 24 kB 0.6 V near-$V_\text{TH}$ weight SRAM that achieves 6.6X lower read power than the foundry-provided SRAM. From chip measurement results, we show that the DeltaKWS achieves an 11/12-class GSCD accuracy of 90.5%/89.5% respectively and energy consumption of 36 nJ/decision in 65 nm CMOS process. At 87% temporal sparsity, computing latency and energy/inference are reduced by 2.4X/3.4X, respectively. The IIR BPF-based FEx, $\Delta$RNN accelerator, and 24 kB near-$V_\text{TH}$ SRAM blocks occupy 0.084 mm$^{2}$, 0.319 mm$^{2}$, and 0.381 mm$^{2}$ respectively (0.78 mm$^{2}$ in total).

Related papers

BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity [66.94629945519125]
We introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques.<n>Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing.<n>Next, to promote both token-level sparsity (TLS) and chunk-level sparsity ( CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly.
arXiv Detail & Related papers (2025-07-11T17:28:56Z)
VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers [13.984340807378457]
Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step.<n>We design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph's method.<n>We execute Softmax with 162.7$times$ less latency and 74.3$times$ less energy compared to the baseline cluster.
arXiv Detail & Related papers (2025-04-15T14:28:48Z)
A 71.2-$μ$W Speech Recognition Accelerator with Recurrent Spiking Neural Network [0.0502254944841629]
We propose a compact recurrent spiking neural network with two recurrent layers, one fully connected layer, and a low time step. The 2.79-MB model undergoes pruning and 4-bit fixed-point quantization, shrinking it by 96.42% to 0.1 MB. The design operates in real time at 100 kHz, consuming 71.2 $mu$W, surpassing state-of-the-art designs.
arXiv Detail & Related papers (2025-03-27T10:14:00Z)
IMAGINE: An 8-to-1b 22nm FD-SOI Compute-In-Memory CNN Accelerator With an End-to-End Analog Charge-Based 0.15-8POPS/W Macro Featuring Distribution-Aware Data Reshaping [0.6071203743728119]
We present IMAGINE, a workload-adaptive 1-to-8b CIM-CNN accelerator in 22nm FD-SOI. It introduces a 1152x256 end-to-end charge-based macro with a multi-bit DP based on an input-serial, weight-parallel accumulation that avoids power-hungry DACs. Measurement results showcase an 8b system-level energy efficiency of 40TOPS/W at 0.3/0.6V, with competitive accuracies on MNIST and CIFAR-10.
arXiv Detail & Related papers (2024-12-27T17:18:15Z)
A Heterogeneous RISC-V based SoC for Secure Nano-UAV Navigation [40.8381466360025]
nano-UAVs face significant power and payload constraints while requiring advanced computing capabilities. We present Shaheen, a 9mm2 200mW system-on-a-chip (SoC) It integrates a Linux-capable RV64 core, compliant with the v1.0 ratified Hypervisor extension, along with a low-cost and low-power memory controller. At the same time, it integrates a fully programmable energy- and area-efficient multi-core cluster of RV32 cores optimized for general-purpose DSP.
arXiv Detail & Related papers (2024-01-07T16:03:47Z)
Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z)
RAMP: A Flat Nanosecond Optical Network and MPI Operations for Distributed Deep Learning Systems [68.8204255655161]
We introduce a near-exascale, full-bisection bandwidth, all-to-all, single-hop, all-optical network architecture with nanosecond reconfiguration called RAMP. RAMP supports large-scale distributed and parallel computing systems (12.8Tbps per node for up to 65,536 nodes.
arXiv Detail & Related papers (2022-11-28T11:24:51Z)
Single-Shot Optical Neural Network [55.41644538483948]
'Weight-stationary' analog optical and electronic hardware has been proposed to reduce the compute resources required by deep neural networks. We present a scalable, single-shot-per-layer weight-stationary optical processor.
arXiv Detail & Related papers (2022-05-18T17:49:49Z)
Vau da muntanialas: Energy-efficient multi-die scalable acceleration of RNN inference [18.50014427283814]
We present Muntaniala, an RNN accelerator architecture for LSTM inference with a silicon-measured energy-efficiency of 3.25$TOP/s/W$. The scalable design of Muntaniala allows running large RNN models by combining multiple tiles in a systolic array. We show a phoneme error rate (PER) drop of approximately 3% with respect to floating-point (FP) on a 3L-384NH-123NI LSTM network.
arXiv Detail & Related papers (2022-02-14T09:21:16Z)
Vega: A 10-Core SoC for IoT End-Nodes with DNN Acceleration and Cognitive Wake-Up From MRAM-Based State-Retentive Sleep Mode [14.214500730272256]
Vega is an IoT end-node system capable of scaling from a 1.7 $mathrmmuW fully retentive cognitive sleep mode up to 32.2 GOPS (@ 49.4 mW) peak on NSAAs. Vega achieves SoA-leading efficiency of 615 GOPS/W on 8-bit INT and 79 and 129 GFLOPS/W on 32- and 16-bit FP.
arXiv Detail & Related papers (2021-10-18T08:47:45Z)
CAP-RAM: A Charge-Domain In-Memory Computing 6T-SRAM for Accurate and Precision-Programmable CNN Inference [27.376343943107788]
CAP-RAM is a compact, accurate, and bitwidth-programmable in-memory computing (IMC) static random-access memory (SRAM) macro. It is presented for energy-efficient convolutional neural network (CNN) inference. A 65-nm prototype validates the excellent linearity and computing accuracy of CAP-RAM.
arXiv Detail & Related papers (2021-07-06T04:59:16Z)
FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks. Current networks often occupy large number of parameters and require heavy computation costs. Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z)
Sound Event Detection with Binary Neural Networks on Tightly Power-Constrained IoT Devices [20.349809458335532]
Sound event detection (SED) is a hot topic in consumer and smart city applications. Existing approaches based on Deep Neural Networks are very effective, but highly demanding in terms of memory, power, and throughput. In this paper, we explore the combination of extreme quantization to a small-print binary neural network (BNN) with the highly energy-efficient, RISC-V-based (8+1)-core GAP8 microcontroller.
arXiv Detail & Related papers (2021-01-12T12:38:23Z)
Always-On 674uW @ 4GOP/s Error Resilient Binary Neural Networks with Aggressive SRAM Voltage Scaling on a 22nm IoT End-Node [15.974669646920331]
Binary Neural Networks (BNNs) have been shown to be robust to random bit-level noise, making aggressive voltage scaling attractive. We introduce the first fully programmable IoT end-node system-on-chip capable of executing hardware-accelerated BNNs at ultra-low voltage. Our prototype performs 4Gop/s (15.4Inference/s on the CIFAR-10 dataset) by computing up to 13 ops per pJ, achieving 22.8 Inference/s/mW while keeping within a peak power envelope of 674uW.
arXiv Detail & Related papers (2020-07-17T12:56:58Z)
SmartExchange: Trading Higher-cost Memory Storage/Access for Lower-cost Computation [97.78417228445883]
We present SmartExchange, an algorithm- hardware co-design framework for energy-efficient inference of deep neural networks (DNNs) We develop a novel algorithm to enforce a specially favorable DNN weight structure, where each layerwise weight matrix can be stored as the product of a small basis matrix and a large sparse coefficient matrix whose non-zero elements are all power-of-2. We further design a dedicated accelerator to fully utilize the SmartExchange-enforced weights to improve both energy efficiency and latency performance.
arXiv Detail & Related papers (2020-05-07T12:12:49Z)
Improving Efficiency in Large-Scale Decentralized Distributed Training [58.80224380923698]
We propose techniques to accelerate (A)D-PSGD based training by improving the spectral gap while minimizing the communication cost. We demonstrate the effectiveness of our proposed techniques by running experiments on the 2000-hour Switchboard speech recognition task and the ImageNet computer vision task.
arXiv Detail & Related papers (2020-02-04T04:29:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.