CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile
Application
- URL: http://arxiv.org/abs/2008.09264v5
- Date: Mon, 25 Apr 2022 14:23:41 GMT
- Title: CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile
Application
- Authors: Yu-Wen Chen, Kuo-Hsuan Hung, You-Jin Li, Alexander Chao-Fu Kang,
Ya-Hsin Lai, Kai-Chun Liu, Szu-Wei Fu, Syu-Siang Wang, Yu Tsao
- Abstract summary: This study presents a deep learning-based speech signal-processing mobile application known as CITISEN.
The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC)
Compared with the noisy speech signals, the enhanced speech signals achieved about 6% and 33% of improvements.
- Score: 63.2243126704342
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study presents a deep learning-based speech signal-processing mobile
application known as CITISEN. The CITISEN provides three functions: speech
enhancement (SE), model adaptation (MA), and background noise conversion (BNC),
allowing CITISEN to be used as a platform for utilizing and evaluating SE
models and flexibly extend the models to address various noise environments and
users. For SE, a pretrained SE model downloaded from the cloud server is used
to effectively reduce noise components from instant or saved recordings
provided by users. For encountering unseen noise or speaker environments, the
MA function is applied to promote CITISEN. A few audio samples recording on a
noisy environment are uploaded and used to adapt the pretrained SE model on the
server. Finally, for BNC, CITISEN first removes the background noises through
an SE model and then mixes the processed speech with new background noise. The
novel BNC function can evaluate SE performance under specific conditions, cover
people's tracks, and provide entertainment. The experimental results confirmed
the effectiveness of SE, MA, and BNC functions. Compared with the noisy speech
signals, the enhanced speech signals achieved about 6\% and 33\% of
improvements, respectively, in terms of short-time objective intelligibility
(STOI) and perceptual evaluation of speech quality (PESQ). With MA, the STOI
and PESQ could be further improved by approximately 6\% and 11\%, respectively.
Finally, the BNC experiment results indicated that the speech signals converted
from noisy and silent backgrounds have a close scene identification accuracy
and similar embeddings in an acoustic scene classification model. Therefore,
the proposed BNC can effectively convert the background noise of a speech
signal and be a data augmentation method when clean speech signals are
unavailable.
Related papers
- Robust Active Speaker Detection in Noisy Environments [29.785749048315616]
We formulate a robust active speaker detection (rASD) problem in noisy environments.
Existing ASD approaches leverage both audio and visual modalities, but non-speech sounds in the surrounding environment can negatively impact performance.
We propose a novel framework that utilizes audio-visual speech separation as guidance to learn noise-free audio features.
arXiv Detail & Related papers (2024-03-27T20:52:30Z) - Continuous Modeling of the Denoising Process for Speech Enhancement
Based on Deep Learning [61.787485727134424]
We use a state variable to indicate the denoising process.
A UNet-like neural network learns to estimate every state variable sampled from the continuous denoising process.
Experimental results indicate that preserving a small amount of noise in the clean target benefits speech enhancement.
arXiv Detail & Related papers (2023-09-17T13:27:11Z) - NLIP: Noise-robust Language-Image Pre-training [95.13287735264937]
We propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion.
Our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way.
arXiv Detail & Related papers (2022-12-14T08:19:30Z) - Inference and Denoise: Causal Inference-based Neural Speech Enhancement [83.4641575757706]
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention.
The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
arXiv Detail & Related papers (2022-11-02T15:03:50Z) - NASTAR: Noise Adaptive Speech Enhancement with Target-Conditional
Resampling [34.565077865854484]
We propose noise adaptive speech enhancement with target-conditional resampling (NASTAR)
NASTAR uses a feedback mechanism to simulate adaptive training data via a noise extractor and a retrieval model.
Experimental results show that NASTAR can effectively use one noisy speech sample to adapt an SE model to a target condition.
arXiv Detail & Related papers (2022-06-18T00:15:48Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Training Speech Enhancement Systems with Noisy Speech Datasets [7.157870452667369]
We propose two improvements to train SE systems on noisy speech data.
First, we propose several modifications of the loss functions, which make them robust against noisy speech targets.
We show that using our robust loss function improves PESQ by up to 0.19 compared to a system trained in the traditional way.
arXiv Detail & Related papers (2021-05-26T03:32:39Z) - Speech Enhancement for Wake-Up-Word detection in Voice Assistants [60.103753056973815]
Keywords spotting and in particular Wake-Up-Word (WUW) detection is a very important task for voice assistants.
This paper proposes a Speech Enhancement model adapted to the task of WUW detection.
It aims at increasing the recognition rate and reducing the false alarms in the presence of these types of noises.
arXiv Detail & Related papers (2021-01-29T18:44:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.