Efficient Low-Latency Speech Enhancement with Mobile Audio Streaming
Networks
- URL: http://arxiv.org/abs/2008.07244v1
- Date: Mon, 17 Aug 2020 12:18:34 GMT
- Title: Efficient Low-Latency Speech Enhancement with Mobile Audio Streaming
Networks
- Authors: Micha{\l} Romaniuk, Piotr Masztalski, Karol Piaskowski, Mateusz
Matuszewski
- Abstract summary: We propose Mobile Audio Streaming Networks (MASnet) for efficient low-latency speech enhancement.
MASnet processes linear-scale spectrograms, transforming successive noisy frames into complex-valued ratio masks.
- Score: 6.82469220191368
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose Mobile Audio Streaming Networks (MASnet) for efficient low-latency
speech enhancement, which is particularly suitable for mobile devices and other
applications where computational capacity is a limitation. MASnet processes
linear-scale spectrograms, transforming successive noisy frames into
complex-valued ratio masks which are then applied to the respective noisy
frames. MASnet can operate in a low-latency incremental inference mode which
matches the complexity of layer-by-layer batch mode. Compared to a similar
fully-convolutional architecture, MASnet incorporates depthwise and pointwise
convolutions for a large reduction in fused multiply-accumulate operations per
second (FMA/s), at the cost of some reduction in SNR.
Related papers
- ASMR: Activation-sharing Multi-resolution Coordinate Networks For Efficient Inference [6.005712471509875]
Coordinate network or implicit neural representation (INR) is a fast-emerging method for encoding natural signals.
We propose the Activation-Sharing Multi-Resolution (ASMR) coordinate network that combines multi-resolution coordinate decomposition with hierarchical modulations.
We show that ASMR can reduce the MAC of a vanilla SIREN model by up to 500x while achieving an even higher reconstruction quality than its SIREN baseline.
arXiv Detail & Related papers (2024-05-20T22:35:34Z) - Efficient Multi-scale Network with Learnable Discrete Wavelet Transform for Blind Motion Deblurring [25.36888929483233]
We propose a multi-scale network based on single-input and multiple-outputs(SIMO) for motion deblurring.
We combine the characteristics of real-world trajectories with a learnable wavelet transform module to focus on the directional continuity and frequency features of the step-by-step transitions between blurred images to sharp images.
arXiv Detail & Related papers (2023-12-29T02:59:40Z) - Joint Channel Estimation and Feedback with Masked Token Transformers in
Massive MIMO Systems [74.52117784544758]
This paper proposes an encoder-decoder based network that unveils the intrinsic frequency-domain correlation within the CSI matrix.
The entire encoder-decoder network is utilized for channel compression.
Our method outperforms state-of-the-art channel estimation and feedback techniques in joint tasks.
arXiv Detail & Related papers (2023-06-08T06:15:17Z) - Adaptive Dynamic Filtering Network for Image Denoising [8.61083713580388]
In image denoising networks, feature scaling is widely used to enlarge the receptive field size and reduce computational costs.
We propose to employ dynamic convolution to improve the learning of high-frequency and multi-scale features.
We build an efficient denoising network with the proposed DCB and MDCB, named ADFNet.
arXiv Detail & Related papers (2022-11-22T06:54:27Z) - Parallel Gated Neural Network With Attention Mechanism For Speech
Enhancement [0.0]
This paper proposes a novel monaural speech enhancement system, consisting of a Feature Extraction Block (FEB), a Compensation Enhancement Block (ComEB) and a Mask Block (MB)
Experiments are conducted on the Librispeech dataset and results show that the proposed model obtains better performance than recent models in terms of ESTOI and PESQ scores.
arXiv Detail & Related papers (2022-10-26T06:42:19Z) - Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification.
We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information.
SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z) - Collaborative Intelligent Reflecting Surface Networks with Multi-Agent
Reinforcement Learning [63.83425382922157]
Intelligent reflecting surface (IRS) is envisioned to be widely applied in future wireless networks.
In this paper, we investigate a multi-user communication system assisted by cooperative IRS devices with the capability of energy harvesting.
arXiv Detail & Related papers (2022-03-26T20:37:14Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - Learning Frequency-aware Dynamic Network for Efficient Super-Resolution [56.98668484450857]
This paper explores a novel frequency-aware dynamic network for dividing the input into multiple parts according to its coefficients in the discrete cosine transform (DCT) domain.
In practice, the high-frequency part will be processed using expensive operations and the lower-frequency part is assigned with cheap operations to relieve the computation burden.
Experiments conducted on benchmark SISR models and datasets show that the frequency-aware dynamic network can be employed for various SISR neural architectures.
arXiv Detail & Related papers (2021-03-15T12:54:26Z) - Improved MVDR Beamforming Using LSTM Speech Models to Clean Spatial
Clustering Masks [14.942060304734497]
spatial clustering techniques can achieve significant multi-channel noise reduction across relatively arbitrary microphone configurations.
LSTM neural networks have successfully been trained to recognize speech from noise on single-channel inputs, but have difficulty taking full advantage of the information in multi-channel recordings.
This paper integrates these two approaches, training LSTM speech models to clean the masks generated by the Model-based EM Source Separation and Localization (MESSL) spatial clustering method.
arXiv Detail & Related papers (2020-12-02T22:35:00Z) - Progressive Training of Multi-level Wavelet Residual Networks for Image
Denoising [80.10533234415237]
This paper presents a multi-level wavelet residual network (MWRN) architecture as well as a progressive training scheme to improve image denoising performance.
Experiments on both synthetic and real-world noisy images show that our PT-MWRN performs favorably against the state-of-the-art denoising methods.
arXiv Detail & Related papers (2020-10-23T14:14:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.