Lightweight DNN for Full-Band Speech Denoising on Mobile Devices: Exploiting Long and Short Temporal Patterns
- URL: http://arxiv.org/abs/2509.05079v1
- Date: Fri, 05 Sep 2025 13:18:25 GMT
- Title: Lightweight DNN for Full-Band Speech Denoising on Mobile Devices: Exploiting Long and Short Temporal Patterns
- Authors: Konstantinos Drossos, Mikko Heikkinen, Paschalis Tsiaflakis,
- Abstract summary: We present a causal, low latency, and lightweight deep neural network (DNN)-based method for full-band speech denoising.<n>The method is based on a modified UNet architecture employing look-back frames, temporal spanning of convolutional kernels, and recurrent neural networks.<n>The proposed method is evaluated using established speech denoising metrics and publicly available datasets.
- Score: 4.121578819979242
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech denoising (SD) is an important task of many, if not all, modern signal processing chains used in devices and for everyday-life applications. While there are many published and powerful deep neural network (DNN)-based methods for SD, few are optimized for resource-constrained platforms such as mobile devices. Additionally, most DNN-based methods for SD are not focusing on full-band (FB) signals, i.e. having 48 kHz sampling rate, and/or low latency cases. In this paper we present a causal, low latency, and lightweight DNN-based method for full-band SD, leveraging both short and long temporal patterns. The method is based on a modified UNet architecture employing look-back frames, temporal spanning of convolutional kernels, and recurrent neural networks for exploiting short and long temporal patterns in the signal and estimated denoising mask. The DNN operates on a causal frame-by-frame basis taking as an input the STFT magnitude, utilizes inverted bottlenecks inspired by MobileNet, employs causal instance normalization for channel-wise normalization, and achieves a real-time factor below 0.02 when deployed on a modern mobile phone. The proposed method is evaluated using established speech denoising metrics and publicly available datasets, demonstrating its effectiveness in achieving an (SI-)SDR value that outperforms existing FB and low latency SD methods.
Related papers
- Neural-HAR: A Dimension-Gated CNN Accelerator for Real-Time Radar Human Activity Recognition [5.400353553418959]
We introduce a dimension-gated CNN accelerator tailored for real-time radar HAR on resource-constrained platforms.<n>GateCNN attains 86.4% accuracy with only 2.7k parameters and 0.28M FLOPs per inference, comparable to CNN-BiGRU at a fraction of the complexity.<n>Our FPGA prototype on Xilinx Zynq-7000 Z-7007S reaches 107.5 $mu$s latency and 15 mW dynamic power using LUT-based ROM and distributed RAM only.
arXiv Detail & Related papers (2025-10-26T17:42:28Z) - WavInWav: Time-domain Speech Hiding via Invertible Neural Network [78.85443308774484]
Previous audio hiding methods often result in unsatisfactory quality when recovering secret audio.<n>We use a flow-based invertible neural network to establish a direct link between stego audio, cover audio, and secret audio.<n>We also add an encryption technique to protect the hidden data from unauthorized access.
arXiv Detail & Related papers (2025-10-03T11:36:16Z) - Low-power SNN-based audio source localisation using a Hilbert Transform spike encoding scheme [4.49657690895714]
Sound source localisation is used in many consumer devices, to isolate audio from individual speakers and reject noise.<n>Dense band-pass filters are often needed to obtain narrowband signal components from wideband audio.<n>We demonstrate a novel method for sound source localisation on arbitrary microphone arrays, designed for efficient implementation in ultra-low-power spiking neural networks (SNNs)<n>Our approach achieves state-of-the-art accuracy for SNN methods, comparable with traditional non-SNN super-resolution beamforming.
arXiv Detail & Related papers (2024-02-19T00:21:13Z) - Short-Term Memory Convolutions [0.0]
We propose novel method for minimization of inference time latency and memory consumption, called Short-Term Memory Convolution (STMC)
The training of STMC-based models is faster and more stable as the method is based solely on convolutional neural networks (CNNs)
In case of speech separation we achieved a 5-fold reduction in inference time and a 2-fold reduction in latency without affecting the output quality.
arXiv Detail & Related papers (2023-02-08T20:52:24Z) - Lightweight network towards real-time image denoising on mobile devices [26.130379174715742]
Deep convolutional neural networks have achieved great progress in image denoising tasks.
Their complicated architectures and heavy computational cost hinder their deployments on mobile devices.
We propose a mobile-friendly denoising network, namely MFDNet.
arXiv Detail & Related papers (2022-11-09T05:19:26Z) - Braille Letter Reading: A Benchmark for Spatio-Temporal Pattern
Recognition on Neuromorphic Hardware [50.380319968947035]
Recent deep learning approaches have reached accuracy in such tasks, but their implementation on conventional embedded solutions is still computationally very and energy expensive.
We propose a new benchmark for computing tactile pattern recognition at the edge through letters reading.
We trained and compared feed-forward and recurrent spiking neural networks (SNNs) offline using back-propagation through time with surrogate gradients, then we deployed them on the Intel Loihimorphic chip for efficient inference.
Our results show that the LSTM outperforms the recurrent SNN in terms of accuracy by 14%. However, the recurrent SNN on Loihi is 237 times more energy
arXiv Detail & Related papers (2022-05-30T14:30:45Z) - Joint Superposition Coding and Training for Federated Learning over
Multi-Width Neural Networks [52.93232352968347]
This paper aims to integrate two synergetic technologies, federated learning (FL) and width-adjustable slimmable neural network (SNN)
FL preserves data privacy by exchanging the locally trained models of mobile devices. SNNs are however non-trivial, particularly under wireless connections with time-varying channel conditions.
We propose a communication and energy-efficient SNN-based FL (named SlimFL) that jointly utilizes superposition coding (SC) for global model aggregation and superposition training (ST) for updating local models.
arXiv Detail & Related papers (2021-12-05T11:17:17Z) - Two-Timescale End-to-End Learning for Channel Acquisition and Hybrid
Precoding [94.40747235081466]
We propose an end-to-end deep learning-based joint transceiver design algorithm for millimeter wave (mmWave) massive multiple-input multiple-output (MIMO) systems.
We develop a DNN architecture that maps the received pilots into feedback bits at the receiver, and then further maps the feedback bits into the hybrid precoder at the transmitter.
arXiv Detail & Related papers (2021-10-22T20:49:02Z) - Deep Networks for Direction-of-Arrival Estimation in Low SNR [89.45026632977456]
We introduce a Convolutional Neural Network (CNN) that is trained from mutli-channel data of the true array manifold matrix.
We train a CNN in the low-SNR regime to predict DoAs across all SNRs.
Our robust solution can be applied in several fields, ranging from wireless array sensors to acoustic microphones or sonars.
arXiv Detail & Related papers (2020-11-17T12:52:18Z) - Alignment Restricted Streaming Recurrent Neural Network Transducer [29.218353627837214]
We propose a modification to the RNN-T loss function and develop Alignment Restricted RNN-T models.
The Ar-RNN-T loss provides a refined control to navigate the trade-offs between the token emission delays and the Word Error Rate (WER)
The Ar-RNN-T models also improve downstream applications such as the ASR End-pointing by guaranteeing token emissions within any given range of latency.
arXiv Detail & Related papers (2020-11-05T19:38:54Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.