A Deep-Bayesian Framework for Adaptive Speech Duration Modification
- URL: http://arxiv.org/abs/2107.04973v1
- Date: Sun, 11 Jul 2021 05:53:07 GMT
- Title: A Deep-Bayesian Framework for Adaptive Speech Duration Modification
- Authors: Ravi Shankar and Archana Venkataraman
- Abstract summary: We use a Bayesian framework to define a latent attention map that links frames of the input and target utterances.
We train a masked convolutional encoder-decoder network to produce this attention map via a version of the mean absolute error loss function.
We show that our technique results in a high quality of generated speech that is on par with state-of-the-art vocoders.
- Score: 20.99099283004413
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose the first method to adaptively modify the duration of a given
speech signal. Our approach uses a Bayesian framework to define a latent
attention map that links frames of the input and target utterances. We train a
masked convolutional encoder-decoder network to produce this attention map via
a stochastic version of the mean absolute error loss function; our model also
predicts the length of the target speech signal using the encoder embeddings.
The predicted length determines the number of steps for the decoder operation.
During inference, we generate the attention map as a proxy for the similarity
matrix between the given input speech and an unknown target speech signal.
Using this similarity matrix, we compute a warping path of alignment between
the two signals. Our experiments demonstrate that this adaptive framework
produces similar results to dynamic time warping, which relies on a known
target signal, on both voice conversion and emotion conversion tasks. We also
show that our technique results in a high quality of generated speech that is
on par with state-of-the-art vocoders.
Related papers
- Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Voice Activity Detection for Transient Noisy Environment Based on
Diffusion Nets [13.558688470594674]
We address voice activity detection in acoustic environments of transients and stationary noises.
We exploit unique spatial patterns of speech and non-speech audio frames by independently learning their underlying geometric structure.
A deep neural network is trained to separate speech from non-speech frames.
arXiv Detail & Related papers (2021-06-25T17:05:26Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Multi-Discriminator Sobolev Defense-GAN Against Adversarial Attacks for
End-to-End Speech Systems [78.5097679815944]
This paper introduces a defense approach against end-to-end adversarial attacks developed for cutting-edge speech-to-text systems.
First, we represent speech signals with 2D spectrograms using the short-time Fourier transform.
Second, we iteratively find a safe vector using a spectrogram subspace projection operation.
Third, we synthesize a spectrogram with such a safe vector using a novel GAN architecture trained with Sobolev integral probability metric.
arXiv Detail & Related papers (2021-03-15T01:11:13Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z) - Multi-speaker Emotion Conversion via Latent Variable Regularization and
a Chained Encoder-Decoder-Predictor Network [18.275646344620387]
We propose a novel method for emotion conversion in speech based on a chained encoder-decoder-predictor neural network architecture.
We show that our method outperforms the existing state-of-the-art approaches on both, the saliency of emotion conversion and the quality of resynthesized speech.
arXiv Detail & Related papers (2020-07-25T13:59:22Z) - Attention and Encoder-Decoder based models for transforming articulatory
movements at different speaking rates [60.02121449986413]
We propose an encoder-decoder architecture using LSTMs which generates smoother predicted articulatory trajectories.
We analyze amplitude of the transformed articulatory movements at different rates compared to their original counterparts.
We observe that AstNet could model both duration and extent of articulatory movements better than the existing transformation techniques.
arXiv Detail & Related papers (2020-06-04T19:33:26Z) - Speech-to-Singing Conversion based on Boundary Equilibrium GAN [42.739822506085694]
This paper investigates the use of generative adversarial network (GAN)-based models for converting the spectrogram of a speech signal into that of a singing one.
The proposed model generates singing voices with much higher naturalness than an existing non adversarially-trained baseline.
arXiv Detail & Related papers (2020-05-28T08:18:02Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z) - End-to-End Whisper to Natural Speech Conversion using Modified
Transformer Network [0.8399688944263843]
We introduce whisper-to-natural-speech conversion using sequence-to-sequence approach.
We investigate different features like Mel frequency cepstral coefficients and smoothed spectral features.
The proposed networks are trained end-to-end using supervised approach for feature-to-feature transformation.
arXiv Detail & Related papers (2020-04-20T14:47:46Z) - Transformer Transducer: A Streamable Speech Recognition Model with
Transformer Encoders and RNN-T Loss [14.755108017449295]
We present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system.
Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently.
We present results on the LibriSpeech dataset showing that limiting the left context for self-attention makes decoding computationally tractable for streaming.
arXiv Detail & Related papers (2020-02-07T00:04:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.