CleanUMamba: A Compact Mamba Network for Speech Denoising using Channel Pruning
- URL: http://arxiv.org/abs/2410.11062v1
- Date: Mon, 14 Oct 2024 20:18:03 GMT
- Title: CleanUMamba: A Compact Mamba Network for Speech Denoising using Channel Pruning
- Authors: Sjoerd Groot, Qinyu Chen, Jan C. van Gemert, Chang Gao,
- Abstract summary: CleanUMamba is a time-domain neural network architecture designed for causal audio denoising directly applied to raw waveforms.
CleanUMamba achieves a PESQ score of 2.42 and STOI of 95.1% with only 442K parameters and 468M MACs, matching or outperforming larger models in real-time performance.
- Score: 13.632515996508172
- License:
- Abstract: This paper presents CleanUMamba, a time-domain neural network architecture designed for real-time causal audio denoising directly applied to raw waveforms. CleanUMamba leverages a U-Net encoder-decoder structure, incorporating the Mamba state-space model in the bottleneck layer. By replacing conventional self-attention and LSTM mechanisms with Mamba, our architecture offers superior denoising performance while maintaining a constant memory footprint, enabling streaming operation. To enhance efficiency, we applied structured channel pruning, achieving an 8X reduction in model size without compromising audio quality. Our model demonstrates strong results in the Interspeech 2020 Deep Noise Suppression challenge. Specifically, CleanUMamba achieves a PESQ score of 2.42 and STOI of 95.1% with only 442K parameters and 468M MACs, matching or outperforming larger models in real-time performance. Code will be available at: https://github.com/lab-emi/CleanUMamba
Related papers
- Real-time Speech Enhancement on Raw Signals with Deep State-space Modeling [1.0650780147044159]
We present aTENNuate, a simple deep state-space autoencoder configured for efficient online raw speech enhancement.
We benchmark aTENNuate on the VoiceBank + DEMAND and the Microsoft DNS1 synthetic test sets.
The network outperforms previous real-time denoising models in terms of PESQ score, parameter count, MACs, and latency.
arXiv Detail & Related papers (2024-09-05T09:28:56Z) - LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba [54.85262314960038]
Local Attentional Mamba blocks capture both global contexts and local details with linear complexity.
Our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution.
Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62% GFLOPs.
arXiv Detail & Related papers (2024-08-05T16:39:39Z) - DeepCache: Accelerating Diffusion Models for Free [65.02607075556742]
DeepCache is a training-free paradigm that accelerates diffusion models from the perspective of model architecture.
DeepCache capitalizes on the inherent temporal redundancy observed in the sequential denoising steps of diffusion models.
Under the same throughput, DeepCache effectively achieves comparable or even marginally improved results with DDIM or PLMS.
arXiv Detail & Related papers (2023-12-01T17:01:06Z) - Lightweight network towards real-time image denoising on mobile devices [26.130379174715742]
Deep convolutional neural networks have achieved great progress in image denoising tasks.
Their complicated architectures and heavy computational cost hinder their deployments on mobile devices.
We propose a mobile-friendly denoising network, namely MFDNet.
arXiv Detail & Related papers (2022-11-09T05:19:26Z) - Real-Time Target Sound Extraction [13.526450617545537]
We present the first neural network model to achieve real-time and streaming target sound extraction.
We propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder.
arXiv Detail & Related papers (2022-11-04T03:51:23Z) - Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification.
We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information.
SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z) - FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech
Synthesis [77.06890315052563]
We propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency.
Experiments show that our model achieves $19.76times$ speedup for audio generation compared with the current autoregressive model on input sequences of 3 seconds.
arXiv Detail & Related papers (2022-07-08T10:10:39Z) - Streamable Neural Audio Synthesis With Non-Causal Convolutions [1.8275108630751844]
We introduce a new method allowing to produce non-causal streaming models.
This allows to make any convolutional model compatible with real-time buffer-based processing.
We show how our method can be adapted to fit complex architectures with parallel branches.
arXiv Detail & Related papers (2022-04-14T16:00:32Z) - WNARS: WFST based Non-autoregressive Streaming End-to-End Speech
Recognition [59.975078145303605]
We propose a novel framework, namely WNARS, using hybrid CTC-attention AED models and weighted finite-state transducers.
On the AISHELL-1 task, our WNARS achieves a character error rate of 5.22% with 640ms latency, to the best of our knowledge, which is the state-of-the-art performance for online ASR.
arXiv Detail & Related papers (2021-04-08T07:56:03Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z) - TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids [13.369813069254132]
We use model compression techniques to bridge the gap between large neural networks and battery powered hearing aid hardware.
We are the first to demonstrate their efficacy for RNN speech enhancement, using pruning and integer quantization of weights/activations.
Our model achieves a computational latency of 2.39ms, well within the 10ms target and 351$times$ better than previous work.
arXiv Detail & Related papers (2020-05-20T20:37:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.