Optimization of a Real-Time Wavelet-Based Algorithm for Improving Speech
Intelligibility
- URL: http://arxiv.org/abs/2202.02545v1
- Date: Sat, 5 Feb 2022 13:03:57 GMT
- Title: Optimization of a Real-Time Wavelet-Based Algorithm for Improving Speech
Intelligibility
- Authors: Tianqu Kang, Anh-Dung Dinh, Binghong Wang, Tianyuan Du, Yijia Chen,
and Kevin Chau (Hong Kong University of Science and Technology)
- Abstract summary: The discrete-time speech signal is split into frequency sub-bands via a multi-level discrete wavelet transform.
The sub-band gains are adjusted while keeping the overall signal energy unchanged.
The speech intelligibility under various background interference and simulated hearing loss conditions is enhanced.
- Score: 1.0554048699217666
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The optimization of a wavelet-based algorithm to improve speech
intelligibility is reported. The discrete-time speech signal is split into
frequency sub-bands via a multi-level discrete wavelet transform. Various gains
are applied to the sub-band signals before they are recombined to form a
modified version of the speech. The sub-band gains are adjusted while keeping
the overall signal energy unchanged, and the speech intelligibility under
various background interference and simulated hearing loss conditions is
enhanced and evaluated objectively and quantitatively using Google
Speech-to-Text transcription. For English and Chinese noise-free speech,
overall intelligibility is improved, and the transcription accuracy can be
increased by as much as 80 percentage points by reallocating the spectral
energy toward the mid-frequency sub-bands, effectively increasing the
consonant-vowel intensity ratio. This is reasonable since the consonants are
relatively weak and of short duration, which are therefore the most likely to
become indistinguishable in the presence of background noise or high-frequency
hearing impairment. For speech already corrupted by noise, improving
intelligibility is challenging but still realizable. The proposed algorithm is
implementable for real-time signal processing and comparatively simpler than
previous algorithms. Potential applications include speech enhancement, hearing
aids, machine listening, and a better understanding of speech intelligibility.
Related papers
- FlashSpeech: Efficient Zero-Shot Speech Synthesis [37.883762387219676]
FlashSpeech is a large-scale zero-shot speech synthesis system with approximately 5% of the inference time compared with previous work.
We show that FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity.
arXiv Detail & Related papers (2024-04-23T02:57:46Z) - Continuous Modeling of the Denoising Process for Speech Enhancement
Based on Deep Learning [61.787485727134424]
We use a state variable to indicate the denoising process.
A UNet-like neural network learns to estimate every state variable sampled from the continuous denoising process.
Experimental results indicate that preserving a small amount of noise in the clean target benefits speech enhancement.
arXiv Detail & Related papers (2023-09-17T13:27:11Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Improving Speech Enhancement through Fine-Grained Speech Characteristics [42.49874064240742]
We propose a novel approach to speech enhancement aimed at improving perceptual quality and naturalness of enhanced signals.
We first identify key acoustic parameters that have been found to correlate well with voice quality.
We then propose objective functions which are aimed at reducing the difference between clean speech and enhanced speech with respect to these features.
arXiv Detail & Related papers (2022-07-01T07:04:28Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Spectro-Temporal Deep Features for Disordered Speech Assessment and
Recognition [65.25325641528701]
Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed.
Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i- adaptation by up to 263% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation.
arXiv Detail & Related papers (2022-01-14T16:56:43Z) - High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner.
Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z) - Text-to-speech for the hearing impaired [0.0]
Text-to-speech (TTS) systems can compensate for a hearing loss at the source rather than correcting for it at the receiving end.
We propose an algorithm that restores loudness to normal perception at a high resolution in time, frequency and level.
arXiv Detail & Related papers (2020-12-03T18:52:03Z) - Gated Recurrent Fusion with Joint Training Framework for Robust
End-to-End Speech Recognition [64.9317368575585]
This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR.
The GRF algorithm is used to dynamically combine the noisy and enhanced features.
The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
arXiv Detail & Related papers (2020-11-09T08:52:05Z) - Detection of Glottal Closure Instants from Speech Signals: a
Quantitative Review [9.351195374919365]
Five state-of-the-art GCI detection algorithms are compared using six different databases.
The efficacy of these methods is first evaluated on clean speech, both in terms of reliabililty and accuracy.
It is shown that for clean speech, SEDREAMS and YAGA are the best performing techniques, both in terms of identification rate and accuracy.
arXiv Detail & Related papers (2019-12-28T14:12:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.