CheapNET: Improving Light-weight speech enhancement network by projected
loss function
- URL: http://arxiv.org/abs/2311.15959v1
- Date: Mon, 27 Nov 2023 16:03:42 GMT
- Title: CheapNET: Improving Light-weight speech enhancement network by projected
loss function
- Authors: Kaijun Tan, Benzhe Dai, Jiakui Li, Wenyu Mao
- Abstract summary: We introduce a novel projection loss function, diverging from MSE, to enhance noise suppression.
For echo cancellation, the function enables direct predictions on LAEC pre-processed outputs.
Our noise suppression model achieves near state-of-the-art results with only 3.1M parameters and 0.4GFlops/s computational load.
- Score: 0.8192907805418583
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Noise suppression and echo cancellation are critical in speech enhancement
and essential for smart devices and real-time communication. Deployed in voice
processing front-ends and edge devices, these algorithms must ensure efficient
real-time inference with low computational demands. Traditional edge-based
noise suppression often uses MSE-based amplitude spectrum mask training, but
this approach has limitations. We introduce a novel projection loss function,
diverging from MSE, to enhance noise suppression. This method uses projection
techniques to isolate key audio components from noise, significantly improving
model performance. For echo cancellation, the function enables direct
predictions on LAEC pre-processed outputs, substantially enhancing performance.
Our noise suppression model achieves near state-of-the-art results with only
3.1M parameters and 0.4GFlops/s computational load. Moreover, our echo
cancellation model outperforms replicated industry-leading models, introducing
a new perspective in speech enhancement.
Related papers
- Diffusion-based speech enhancement with a weighted generative-supervised
learning loss [0.0]
Diffusion-based generative models have recently gained attention in speech enhancement (SE)
We propose augmenting the original diffusion training objective with a mean squared error (MSE) loss, measuring the discrepancy between estimated enhanced speech and ground-truth clean speech.
arXiv Detail & Related papers (2023-09-19T09:13:35Z) - Noise-aware Speech Enhancement using Diffusion Probabilistic Model [35.17225451626734]
We propose a noise-aware speech enhancement (NASE) approach that extracts noise-specific information to guide the reverse process in diffusion model.
NASE is shown to be a plug-and-play module that can be generalized to any diffusion SE models.
arXiv Detail & Related papers (2023-07-16T12:46:11Z) - Unifying Speech Enhancement and Separation with Gradient Modulation for
End-to-End Noise-Robust Speech Separation [23.758202121043805]
We propose a novel network to unify speech enhancement and separation with gradient modulation to improve noise-robustness.
Experimental results show that our approach achieves the state-of-the-art on large-scale Libri2Mix- and Libri3Mix-noisy datasets.
arXiv Detail & Related papers (2023-02-22T03:54:50Z) - NLIP: Noise-robust Language-Image Pre-training [95.13287735264937]
We propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion.
Our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way.
arXiv Detail & Related papers (2022-12-14T08:19:30Z) - Inference and Denoise: Causal Inference-based Neural Speech Enhancement [83.4641575757706]
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention.
The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
arXiv Detail & Related papers (2022-11-02T15:03:50Z) - Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification.
We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information.
SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Variational Autoencoder for Speech Enhancement with a Noise-Aware
Encoder [30.318947721658862]
We propose to include noise information in the training phase by using a noise-aware encoder trained on noisy-clean speech pairs.
We show that our proposed noise-aware VAE outperforms the standard VAE in terms of overall distortion without increasing the number of model parameters.
arXiv Detail & Related papers (2021-02-17T11:40:42Z) - CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile
Application [63.2243126704342]
This study presents a deep learning-based speech signal-processing mobile application known as CITISEN.
The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC)
Compared with the noisy speech signals, the enhanced speech signals achieved about 6% and 33% of improvements.
arXiv Detail & Related papers (2020-08-21T02:04:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.