MaskCycleGAN-based Whisper to Normal Speech Conversion
- URL: http://arxiv.org/abs/2408.14797v1
- Date: Tue, 27 Aug 2024 06:07:18 GMT
- Title: MaskCycleGAN-based Whisper to Normal Speech Conversion
- Authors: K. Rohith Gupta, K. Ramnath, S. Johanan Joysingh, P. Vijayalakshmi, T. Nagarajan,
- Abstract summary: We present a MaskCycleGAN approach for the conversion of whispered speech to normal speech.
We find that tuning the mask parameters, and pre-processing the signal with a voice activity detector provides superior performance.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Whisper to normal speech conversion is an active area of research. Various architectures based on generative adversarial networks have been proposed in the recent past. Especially, recent study shows that MaskCycleGAN, which is a mask guided, and cyclic consistency keeping, generative adversarial network, performs really well for voice conversion from spectrogram representations. In the current work we present a MaskCycleGAN approach for the conversion of whispered speech to normal speech. We find that tuning the mask parameters, and pre-processing the signal with a voice activity detector provides superior performance when compared to the existing approach. The wTIMIT dataset is used for evaluation. Objective metrics such as PESQ and G-Loss are used to evaluate the converted speech, along with subjective evaluation using mean opinion score. The results show that the proposed approach offers considerable benefits.
Related papers
- Bridge the Points: Graph-based Few-shot Segment Anything Semantically [79.1519244940518]
Recent advancements in pre-training techniques have enhanced the capabilities of vision foundation models.
Recent studies extend the SAM to Few-shot Semantic segmentation (FSS)
We propose a simple yet effective approach based on graph analysis.
arXiv Detail & Related papers (2024-10-09T15:02:28Z) - MaskSR: Masked Language Model for Full-band Speech Restoration [7.015213589171985]
Speech restoration aims at restoring high quality speech in the presence of a diverse set of distortions.
We propose MaskSR, a masked language model capable of restoring full-band 44.1 kHz speech jointly considering noise, reverb, clipping, and low bandwidth.
arXiv Detail & Related papers (2024-06-04T08:23:57Z) - Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation
and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper.
Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end.
Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z) - Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition [66.94463981654216]
We propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive Visual Speech Recognition (VSR)
We finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters.
The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases.
arXiv Detail & Related papers (2023-02-16T06:01:31Z) - Speech segmentation using multilevel hybrid filters [0.0]
A novel approach for speech segmentation is proposed, based on Multilevel Hybrid (mean/min) Filters (MHF)
The proposed method is based on spectral changes, with the goal of segmenting the voice into homogeneous acoustic segments.
This algorithm is being used for phoneticallysegmented speech coder, with successful results.
arXiv Detail & Related papers (2022-02-24T00:03:02Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Gated Recurrent Fusion with Joint Training Framework for Robust
End-to-End Speech Recognition [64.9317368575585]
This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR.
The GRF algorithm is used to dynamically combine the noisy and enhanced features.
The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
arXiv Detail & Related papers (2020-11-09T08:52:05Z) - Are you wearing a mask? Improving mask detection from speech using
augmentation by cycle-consistent GANs [24.182791316595576]
We propose a novel data augmentation approach for mask detection from speech.
Our approach is based on (i) training Geneversarative Adrial Networks (GANs) with cycle-consistency loss to translate unpaired utterances.
We show that our data augmentation approach yields better results than other baseline and state-of-the-art augmentation methods.
arXiv Detail & Related papers (2020-06-17T20:46:50Z) - End-to-End Whisper to Natural Speech Conversion using Modified
Transformer Network [0.8399688944263843]
We introduce whisper-to-natural-speech conversion using sequence-to-sequence approach.
We investigate different features like Mel frequency cepstral coefficients and smoothed spectral features.
The proposed networks are trained end-to-end using supervised approach for feature-to-feature transformation.
arXiv Detail & Related papers (2020-04-20T14:47:46Z) - End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice
Activity Detection [48.80449801938696]
This paper integrates a voice activity detection function with end-to-end automatic speech recognition.
We focus on connectionist temporal classification ( CTC) and its extension ofsynchronous/attention.
We use the labels as a cue for detecting speech segments with simple thresholding.
arXiv Detail & Related papers (2020-02-03T03:36:34Z) - Detection of Glottal Closure Instants from Speech Signals: a
Quantitative Review [9.351195374919365]
Five state-of-the-art GCI detection algorithms are compared using six different databases.
The efficacy of these methods is first evaluated on clean speech, both in terms of reliabililty and accuracy.
It is shown that for clean speech, SEDREAMS and YAGA are the best performing techniques, both in terms of identification rate and accuracy.
arXiv Detail & Related papers (2019-12-28T14:12:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.