End-to-End Whisper to Natural Speech Conversion using Modified
Transformer Network
- URL: http://arxiv.org/abs/2004.09347v3
- Date: Mon, 5 Apr 2021 09:27:12 GMT
- Title: End-to-End Whisper to Natural Speech Conversion using Modified
Transformer Network
- Authors: Abhishek Niranjan, Mukesh Sharma, Sai Bharath Chandra Gutha, M Ali
Basha Shaik
- Abstract summary: We introduce whisper-to-natural-speech conversion using sequence-to-sequence approach.
We investigate different features like Mel frequency cepstral coefficients and smoothed spectral features.
The proposed networks are trained end-to-end using supervised approach for feature-to-feature transformation.
- Score: 0.8399688944263843
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Machine recognition of an atypical speech like whispered speech, is a
challenging task. We introduce whisper-to-natural-speech conversion using
sequence-to-sequence approach by proposing enhanced transformer architecture,
which uses both parallel and non-parallel data. We investigate different
features like Mel frequency cepstral coefficients and smoothed spectral
features. The proposed networks are trained end-to-end using supervised
approach for feature-to-feature transformation. Further, we also investigate
the effectiveness of embedded auxillary decoder used after N encoder
sub-layers, trained with the frame-level objective function for identifying
source phoneme labels. We show results on opensource wTIMIT and CHAINS datasets
by measuring word error rate using end-to-end ASR and also BLEU scores for the
generated speech. Alternatively, we also propose a novel method to measure
spectral shape of it by measuring formant distributions w.r.t. reference
speech, as formant divergence metric. We have found whisper-to-natural
converted speech formants probability distribution is similar to the
groundtruth distribution. To the authors' best knowledge, this is the first
time enhanced transformer has been proposed, both with and without auxiliary
decoder for whisper-to-natural-speech conversion and vice versa.
Related papers
- Style Description based Text-to-Speech with Conditional Prosodic Layer
Normalization based Diffusion GAN [17.876323494898536]
We present a Diffusion GAN based approach (Prosodic Diff-TTS) to generate the corresponding high-fidelity speech based on the style description and content text as an input to generate speech samples within only 4 denoising steps.
We demonstrate the efficacy of our proposed architecture on multi-speaker LibriTTS and PromptSpeech datasets, using multiple quantitative metrics that measure generated accuracy and MOS.
arXiv Detail & Related papers (2023-10-27T14:28:41Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Knowledge Distillation from BERT Transformer to Speech Transformer for
Intent Classification [66.62686601948455]
We exploit the scope of the transformer distillation method that is specifically designed for knowledge distillation from a transformer based language model to a transformer based speech model.
We achieve an intent classification accuracy of 99.10% and 88.79% for Fluent speech corpus and ATIS database, respectively.
arXiv Detail & Related papers (2021-08-05T13:08:13Z) - A Deep-Bayesian Framework for Adaptive Speech Duration Modification [20.99099283004413]
We use a Bayesian framework to define a latent attention map that links frames of the input and target utterances.
We train a masked convolutional encoder-decoder network to produce this attention map via a version of the mean absolute error loss function.
We show that our technique results in a high quality of generated speech that is on par with state-of-the-art vocoders.
arXiv Detail & Related papers (2021-07-11T05:53:07Z) - Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and
Backward Transformers [49.403414751667135]
This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR)
The proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem.
Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment.
arXiv Detail & Related papers (2021-04-21T03:05:12Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Multi-speaker Emotion Conversion via Latent Variable Regularization and
a Chained Encoder-Decoder-Predictor Network [18.275646344620387]
We propose a novel method for emotion conversion in speech based on a chained encoder-decoder-predictor neural network architecture.
We show that our method outperforms the existing state-of-the-art approaches on both, the saliency of emotion conversion and the quality of resynthesized speech.
arXiv Detail & Related papers (2020-07-25T13:59:22Z) - Learning to Count Words in Fluent Speech enables Online Speech
Recognition [10.74796391075403]
We introduce Taris, a Transformer-based online speech recognition system aided by an auxiliary task of incremental word counting.
Experiments performed on the LRS2, LibriSpeech, and Aishell-1 datasets show that the online system performs comparable with the offline one when having a dynamic algorithmic delay of 5 segments.
arXiv Detail & Related papers (2020-06-08T20:49:39Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - Vocoder-free End-to-End Voice Conversion with Transformer Network [5.5792083698526405]
Mel-frequency filter bank (MFB) based approaches have the advantage of learning speech compared to raw spectrum since MFB has less feature size.
It is possible to only use the raw spectrum along with the phase to generate different style of voices with clear pronunciation.
In this paper, we introduce a vocoder-free end-to-end voice conversion method using transformer network.
arXiv Detail & Related papers (2020-02-05T06:19:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.