Related papers: A Low-Power Streaming Speech Enhancement Accelerator For Edge Devices

A Low-Power Streaming Speech Enhancement Accelerator For Edge Devices

URL: http://arxiv.org/abs/2503.21335v1
Date: Thu, 27 Mar 2025 10:13:41 GMT
Title: A Low-Power Streaming Speech Enhancement Accelerator For Edge Devices
Authors: Ci-Hao Wu, Tian-Sheuan Chang,
Abstract summary: Transformer-based speech enhancement models yield impressive results, but their structure restricts model compression potential.<n>This paper proposes a low-power streaming speech enhancement accelerator through model and hardware optimization.<n>The proposed high performance model is optimized for hardware execution with the co-design of model compression and target application.
Score: 0.0502254944841629
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Transformer-based speech enhancement models yield impressive results. However, their heterogeneous and complex structure restricts model compression potential, resulting in greater complexity and reduced hardware efficiency. Additionally, these models are not tailored for streaming and low-power applications. Addressing these challenges, this paper proposes a low-power streaming speech enhancement accelerator through model and hardware optimization. The proposed high performance model is optimized for hardware execution with the co-design of model compression and target application, which reduces 93.9\% of model size by the proposed domain-aware and streaming-aware pruning techniques. The required latency is further reduced with batch normalization-based transformers. Additionally, we employed softmax-free attention, complemented by an extra batch normalization, facilitating simpler hardware design. The tailored hardware accommodates these diverse computing patterns by breaking them down into element-wise multiplication and accumulation (MAC). This is achieved through a 1-D processing array, utilizing configurable SRAM addressing, thereby minimizing hardware complexities and simplifying zero skipping. Using the TSMC 40nm CMOS process, the final implementation requires merely 207.8K gates and 53.75KB SRAM. It consumes only 8.08 mW for real-time inference at a 62.5MHz frequency.

Related papers

An ultra-low-power CGRA for accelerating Transformers at the edge [1.52292571922932]
This paper introduces an ultra-low-power, Coarse-Grained Reconfigurable Array (CGRA) architecture to accelerate General Matrix multiplication (GEMM) operations in transformer models.<n>The proposed architecture integrates a 4 x 4 array of Processing Elements (PEs) for efficient parallel computation and dedicated 4 x 2 Memory Operation Blocks (MOBs) for optimized LOAD/STORE operations.<n>A switchless mesh torus interconnect network further minimizes power and latency by enabling direct communication between PEs and MOBs.
arXiv Detail & Related papers (2025-07-17T08:43:14Z)
Optimising TinyML with Quantization and Distillation of Transformer and Mamba Models for Indoor Localisation on Edge Devices [7.229732269884237]
This paper proposes small and efficient machine learning models (TinyML) for resource-constrained edge devices. The work focuses on model compression techniques, including quantization and knowledge distillation, to significantly reduce the model size. The application of these TinyML models in healthcare has the potential to revolutionize patient monitoring.
arXiv Detail & Related papers (2024-12-12T13:59:21Z)
MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework.<n>MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions.<n>Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z)
Image Compression for Machine and Human Vision with Spatial-Frequency Adaptation [61.22401987355781]
Image compression for machine and human vision (ICMH) has gained increasing attention in recent years. Existing ICMH methods are limited by high training and storage overheads due to heavy design of task-specific networks. We develop a novel lightweight adapter-based tuning framework for ICMH, named Adapt-ICMH.
arXiv Detail & Related papers (2024-07-13T11:22:41Z)
ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers [13.177523799771635]
Transformer networks have emerged as the state-of-the-art approach for natural language processing tasks. The efficient hardware acceleration of transformer models poses new challenges due to their high arithmetic intensities, large memory requirements, and complex dataflow dependencies. We propose ITA, a novel accelerator architecture for transformers and related models that targets efficient inference on embedded systems.
arXiv Detail & Related papers (2023-07-07T10:05:38Z)
Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt [96.24800696597707]
We introduce a new perspective to optimize this trade-off by prompting compressed models. We propose a soft prompt learning method where we expose the compressed model to the prompt learning process. Our experimental analysis suggests our soft prompt strategy greatly improves the performance of the 8x compressed LLaMA-7B model.
arXiv Detail & Related papers (2023-05-17T20:45:13Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
TransCODE: Co-design of Transformers and Accelerators for Efficient Training and Inference [6.0093441900032465]
We propose a framework that simulates transformer inference and training on a design space of accelerators. We use this simulator in conjunction with the proposed co-design technique, called TransCODE, to obtain the best-performing models. The obtained transformer-accelerator pair achieves 0.3% higher accuracy than the state-of-the-art pair.
arXiv Detail & Related papers (2023-03-27T02:45:18Z)
Data-Model-Circuit Tri-Design for Ultra-Light Video Intelligence on Edge Devices [90.30316433184414]
We propose a data-model-hardware tri-design framework for high- throughput, low-cost, and high-accuracy MOT on HD video stream. Compared to the state-of-the-art MOT baseline, our tri-design approach can achieve 12.5x latency reduction, 20.9x effective frame rate improvement, 5.83x lower power, and 9.78x better energy efficiency, without much accuracy drop.
arXiv Detail & Related papers (2022-10-16T16:21:40Z)
An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers [11.811907838840712]
We propose an algorithm-hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns. We present a flexible and efficient hardware architecture, namely STA, to achieve significant speedup when deploying N:M sparse Transformers. Experimental results show that compared to other methods, N:M sparse Transformers, generated using IDP, achieves an average of 6.7% improvement on accuracy with high training efficiency.
arXiv Detail & Related papers (2022-08-12T04:51:49Z)
Hardware-Robust In-RRAM-Computing for Object Detection [0.15113576014047125]
In-RRAM computing suffered from large device variation and numerous nonideal effects in hardware. This paper proposes a joint hardware and software optimization strategy to design a hardware-robust IRC macro for object detection. The proposed approach has been successfully applied to a complex object detection task with only 3.85% mAP drop.
arXiv Detail & Related papers (2022-05-09T01:46:24Z)
A TinyML Platform for On-Device Continual Learning with Quantized Latent Replays [66.62377866022221]
Latent Replay-based Continual Learning (CL) techniques enable online, serverless adaptation in principle. We introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power processor. Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory.
arXiv Detail & Related papers (2021-10-20T11:01:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.