Incorporating Broad Phonetic Information for Speech Enhancement
- URL: http://arxiv.org/abs/2008.07618v1
- Date: Thu, 13 Aug 2020 09:38:08 GMT
- Title: Incorporating Broad Phonetic Information for Speech Enhancement
- Authors: Yen-Ju Lu, Chien-Feng Liao, Xugang Lu, Jeih-weih Hung and Yu Tsao
- Abstract summary: In noisy conditions, knowing speech contents facilitates listeners to more effectively suppress background noise components.
Previous studies have confirmed the benefits of incorporating phonetic information in a speech enhancement system.
This study proposes to incorporate the broad phonetic class (BPC) information into the SE process.
- Score: 23.12902068334228
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In noisy conditions, knowing speech contents facilitates listeners to more
effectively suppress background noise components and to retrieve pure speech
signals. Previous studies have also confirmed the benefits of incorporating
phonetic information in a speech enhancement (SE) system to achieve better
denoising performance. To obtain the phonetic information, we usually prepare a
phoneme-based acoustic model, which is trained using speech waveforms and
phoneme labels. Despite performing well in normal noisy conditions, when
operating in very noisy conditions, however, the recognized phonemes may be
erroneous and thus misguide the SE process. To overcome the limitation, this
study proposes to incorporate the broad phonetic class (BPC) information into
the SE process. We have investigated three criteria to build the BPC, including
two knowledge-based criteria: place and manner of articulatory and one
data-driven criterion. Moreover, the recognition accuracies of BPCs are much
higher than that of phonemes, thus providing more accurate phonetic information
to guide the SE process under very noisy conditions. Experimental results
demonstrate that the proposed SE with the BPC information framework can achieve
notable performance improvements over the baseline system and an SE system
using monophonic information in terms of both speech quality intelligibility on
the TIMIT dataset.
Related papers
- Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Inference and Denoise: Causal Inference-based Neural Speech Enhancement [83.4641575757706]
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention.
The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
arXiv Detail & Related papers (2022-11-02T15:03:50Z) - Noise-robust Speech Recognition with 10 Minutes Unparalleled In-domain
Data [24.512424190830828]
We propose a generative adversarial network to simulate noisy spectrum from the clean spectrum (Simu-GAN)
We also propose a dual-path speech recognition system to improve the robustness of the system under noisy conditions.
arXiv Detail & Related papers (2022-03-29T08:06:01Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Speech Enhancement for Wake-Up-Word detection in Voice Assistants [60.103753056973815]
Keywords spotting and in particular Wake-Up-Word (WUW) detection is a very important task for voice assistants.
This paper proposes a Speech Enhancement model adapted to the task of WUW detection.
It aims at increasing the recognition rate and reducing the false alarms in the presence of these types of noises.
arXiv Detail & Related papers (2021-01-29T18:44:05Z) - Improving Speech Enhancement Performance by Leveraging Contextual Broad
Phonetic Class Information [33.79855927394387]
We explore the contextual information of articulatory attributes as additional information to further benefit speech enhancement.
We propose to improve the SE performance by leveraging losses from an end-to-end automatic speech recognition model.
Experimental results from speech denoising, speech dereverb, and impaired speech enhancement tasks confirmed that contextual BPC information improves SE performance.
arXiv Detail & Related papers (2020-11-15T03:56:37Z) - Correlating Subword Articulation with Lip Shapes for Embedding Aware
Audio-Visual Speech Enhancement [94.0676772764248]
We propose a visual embedding approach to improving embedding aware speech enhancement (EASE)
We first extract visual embedding from lip frames using a pre-trained phone or articulation place recognizer for visual-only EASE (VEASE)
Next, we extract audio-visual embedding from noisy speech and lip videos in an information intersection manner, utilizing a complementarity of audio and visual features for multi-modal EASE (MEASE)
arXiv Detail & Related papers (2020-09-21T01:26:19Z) - CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile
Application [63.2243126704342]
This study presents a deep learning-based speech signal-processing mobile application known as CITISEN.
The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC)
Compared with the noisy speech signals, the enhanced speech signals achieved about 6% and 33% of improvements.
arXiv Detail & Related papers (2020-08-21T02:04:12Z) - Adversarial Feature Learning and Unsupervised Clustering based Speech
Synthesis for Found Data with Acoustic and Textual Noise [18.135965605011105]
Attention-based sequence-to-sequence (seq2seq) speech synthesis has achieved extraordinary performance.
A studio-quality corpus with manual transcription is necessary to train such seq2seq systems.
We propose an approach to build high-quality and stable seq2seq based speech synthesis system using challenging found data.
arXiv Detail & Related papers (2020-04-28T15:32:45Z) - Robust Speaker Recognition Using Speech Enhancement And Attention Model [37.33388614967888]
Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks.
To increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain.
The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.
arXiv Detail & Related papers (2020-01-14T20:03:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.