Inference and Denoise: Causal Inference-based Neural Speech Enhancement
- URL: http://arxiv.org/abs/2211.01189v1
- Date: Wed, 2 Nov 2022 15:03:50 GMT
- Title: Inference and Denoise: Causal Inference-based Neural Speech Enhancement
- Authors: Tsun-An Hsieh, Chao-Han Huck Yang, Pin-Yu Chen, Sabato Marco
Siniscalchi, Yu Tsao
- Abstract summary: This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention.
The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
- Score: 83.4641575757706
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This study addresses the speech enhancement (SE) task within the causal
inference paradigm by modeling the noise presence as an intervention. Based on
the potential outcome framework, the proposed causal inference-based speech
enhancement (CISE) separates clean and noisy frames in an intervened noisy
speech using a noise detector and assigns both sets of frames to two mask-based
enhancement modules (EMs) to perform noise-conditional SE. Specifically, we use
the presence of noise as guidance for EM selection during training, and the
noise detector selects the enhancement module according to the prediction of
the presence of noise for each frame. Moreover, we derived a SE-specific
average treatment effect to quantify the causal effect adequately. Experimental
evidence demonstrates that CISE outperforms a non-causal mask-based SE approach
in the studied settings and has better performance and efficiency than more
complex SE models.
Related papers
- Noise-Robust Target-Speaker Voice Activity Detection Through Self-Supervised Pretraining [21.26555178371168]
Target-Speaker Voice Activity Detection (TS-VAD) is the task of detecting the presence of speech from a known target-speaker in an audio frame.
Deep neural network-based models have shown good performance in this task.
We propose a causal, Self-Supervised Learning (SSL) pretraining framework to enhance TS-VAD performance in noisy conditions.
arXiv Detail & Related papers (2025-01-06T18:00:14Z) - Enhance Vision-Language Alignment with Noise [59.2608298578913]
We investigate whether the frozen model can be fine-tuned by customized noise.
We propose Positive-incentive Noise (PiNI) which can fine-tune CLIP via injecting noise into both visual and text encoders.
arXiv Detail & Related papers (2024-12-14T12:58:15Z) - Robust Active Speaker Detection in Noisy Environments [29.785749048315616]
We formulate a robust active speaker detection (rASD) problem in noisy environments.
Existing ASD approaches leverage both audio and visual modalities, but non-speech sounds in the surrounding environment can negatively impact performance.
We propose a novel framework that utilizes audio-visual speech separation as guidance to learn noise-free audio features.
arXiv Detail & Related papers (2024-03-27T20:52:30Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Noise-aware Speech Enhancement using Diffusion Probabilistic Model [35.17225451626734]
We propose a noise-aware speech enhancement (NASE) approach that extracts noise-specific information to guide the reverse process in diffusion model.
NASE is shown to be a plug-and-play module that can be generalized to any diffusion SE models.
arXiv Detail & Related papers (2023-07-16T12:46:11Z) - Unsupervised speech enhancement with deep dynamical generative speech
and noise models [26.051535142743166]
This work builds on a previous work on unsupervised speech enhancement using a dynamical variational autoencoder (DVAE) as the clean speech model and non-negative matrix factorization (NMF) as the noise model.
We propose to replace the NMF noise model with a deep dynamical generative model (DDGM) depending either on the DVAE latent variables, or on the noisy observations, or on both.
arXiv Detail & Related papers (2023-06-13T14:52:35Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile
Application [63.2243126704342]
This study presents a deep learning-based speech signal-processing mobile application known as CITISEN.
The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC)
Compared with the noisy speech signals, the enhanced speech signals achieved about 6% and 33% of improvements.
arXiv Detail & Related papers (2020-08-21T02:04:12Z) - Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features.
At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features.
At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.