Improving Speech Enhancement Performance by Leveraging Contextual Broad
Phonetic Class Information
- URL: http://arxiv.org/abs/2011.07442v5
- Date: Sun, 18 Jun 2023 11:52:45 GMT
- Title: Improving Speech Enhancement Performance by Leveraging Contextual Broad
Phonetic Class Information
- Authors: Yen-Ju Lu, Chia-Yu Chang, Cheng Yu, Ching-Feng Liu, Jeih-weih Hung,
Shinji Watanabe, Yu Tsao
- Abstract summary: We explore the contextual information of articulatory attributes as additional information to further benefit speech enhancement.
We propose to improve the SE performance by leveraging losses from an end-to-end automatic speech recognition model.
Experimental results from speech denoising, speech dereverb, and impaired speech enhancement tasks confirmed that contextual BPC information improves SE performance.
- Score: 33.79855927394387
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Previous studies have confirmed that by augmenting acoustic features with the
place/manner of articulatory features, the speech enhancement (SE) process can
be guided to consider the broad phonetic properties of the input speech when
performing enhancement to attain performance improvements. In this paper, we
explore the contextual information of articulatory attributes as additional
information to further benefit SE. More specifically, we propose to improve the
SE performance by leveraging losses from an end-to-end automatic speech
recognition (E2E-ASR) model that predicts the sequence of broad phonetic
classes (BPCs). We also developed multi-objective training with ASR and
perceptual losses to train the SE system based on a BPC-based E2E-ASR.
Experimental results from speech denoising, speech dereverberation, and
impaired speech enhancement tasks confirmed that contextual BPC information
improves SE performance. Moreover, the SE model trained with the BPC-based
E2E-ASR outperforms that with the phoneme-based E2E-ASR. The results suggest
that objectives with misclassification of phonemes by the ASR system may lead
to imperfect feedback, and BPC could be a potentially better choice. Finally,
it is noted that combining the most-confusable phonetic targets into the same
BPC when calculating the additional objective can effectively improve the SE
performance.
Related papers
- ROSE: A Recognition-Oriented Speech Enhancement Framework in Air Traffic Control Using Multi-Objective Learning [6.60571587618006]
Radio speech echo is a specific phenomenon in the air traffic control (ATC) domain, which degrades speech quality and impacts automatic speech recognition (ASR) accuracy.
In this work, a time-domain recognition-oriented speech enhancement framework is proposed to improve speech intelligibility and advance ASR accuracy.
The framework serves as a plug-and-play tool in ATC scenarios and does not require additional retraining of the ASR model.
arXiv Detail & Related papers (2023-12-11T04:51:41Z) - Enhancing and Adversarial: Improve ASR with Speaker Labels [49.73714831258699]
We propose a novel adaptive gradient reversal layer for stable and effective adversarial training without tuning effort.
Detailed analysis and experimental verification are conducted to show the optimal positions in the ASR neural network (NN) to apply speaker enhancing and adversarial training.
Our best speaker-based MTL achieves 7% relative improvement on the Switchboard Hub5'00 set.
arXiv Detail & Related papers (2022-11-11T17:40:08Z) - Improving Speech Enhancement through Fine-Grained Speech Characteristics [42.49874064240742]
We propose a novel approach to speech enhancement aimed at improving perceptual quality and naturalness of enhanced signals.
We first identify key acoustic parameters that have been found to correlate well with voice quality.
We then propose objective functions which are aimed at reducing the difference between clean speech and enhanced speech with respect to these features.
arXiv Detail & Related papers (2022-07-01T07:04:28Z) - Towards Intelligibility-Oriented Audio-Visual Speech Enhancement [8.19144665585397]
We present a fully convolutional AV SE model that uses a modified short-time objective intelligibility (STOI) metric as a training cost function.
Our proposed I-O AV SE framework outperforms audio-only (AO) and AV models trained with conventional distance-based loss functions.
arXiv Detail & Related papers (2021-11-18T11:47:37Z) - PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation
Extraction [90.55375210094995]
Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise.
We propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction.
arXiv Detail & Related papers (2021-10-03T07:05:29Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Improving Perceptual Quality by Phone-Fortified Perceptual Loss using
Wasserstein Distance for Speech Enhancement [23.933935913913043]
We propose a phone-fortified perceptual loss (PFPL) that takes phonetic information into account for training SE models.
To effectively incorporate the phonetic information, the PFPL is computed based on latent representations of the wav2vec model.
Our experimental results first reveal that the PFPL is more correlated with the perceptual evaluation metrics, as compared to signal-level losses.
arXiv Detail & Related papers (2020-10-28T18:34:28Z) - Correlating Subword Articulation with Lip Shapes for Embedding Aware
Audio-Visual Speech Enhancement [94.0676772764248]
We propose a visual embedding approach to improving embedding aware speech enhancement (EASE)
We first extract visual embedding from lip frames using a pre-trained phone or articulation place recognizer for visual-only EASE (VEASE)
Next, we extract audio-visual embedding from noisy speech and lip videos in an information intersection manner, utilizing a complementarity of audio and visual features for multi-modal EASE (MEASE)
arXiv Detail & Related papers (2020-09-21T01:26:19Z) - Incorporating Broad Phonetic Information for Speech Enhancement [23.12902068334228]
In noisy conditions, knowing speech contents facilitates listeners to more effectively suppress background noise components.
Previous studies have confirmed the benefits of incorporating phonetic information in a speech enhancement system.
This study proposes to incorporate the broad phonetic class (BPC) information into the SE process.
arXiv Detail & Related papers (2020-08-13T09:38:08Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.