GANtron: Emotional Speech Synthesis with Generative Adversarial Networks
- URL: http://arxiv.org/abs/2110.03390v1
- Date: Wed, 6 Oct 2021 10:44:30 GMT
- Title: GANtron: Emotional Speech Synthesis with Generative Adversarial Networks
- Authors: Enrique Hortal and Rodrigo Brechard Alarcia
- Abstract summary: We propose a text-to-speech model where the inferred speech can be tuned with the desired emotions.
We use Generative Adversarial Networks (GANs) together with a sequence-to-sequence model using an attention mechanism.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Speech synthesis is used in a wide variety of industries. Nonetheless, it
always sounds flat or robotic. The state of the art methods that allow for
prosody control are very cumbersome to use and do not allow easy tuning. To
tackle some of these drawbacks, in this work we target the implementation of a
text-to-speech model where the inferred speech can be tuned with the desired
emotions. To do so, we use Generative Adversarial Networks (GANs) together with
a sequence-to-sequence model using an attention mechanism. We evaluate four
different configurations considering different inputs and training strategies,
study them and prove how our best model can generate speech files that lie in
the same distribution as the initial training dataset. Additionally, a new
strategy to boost the training convergence by applying a guided attention loss
is proposed.
Related papers
- Multi-modal Adversarial Training for Zero-Shot Voice Cloning [9.823246184635103]
We propose a Transformer encoder-decoder architecture to conditionally discriminate between real and generated speech features.
We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset.
Our model achieves improvements over the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-08-28T16:30:41Z) - Re-ENACT: Reinforcement Learning for Emotional Speech Generation using Actor-Critic Strategy [8.527959937101826]
We train a neural network to produce the variational posterior of a collection of Bernoulli random variables.
We modify the prosodic features of a masked segment to increase the score of target emotion.
Our experiments demonstrate that this framework changes the perceived emotion of a given speech utterance to the target.
arXiv Detail & Related papers (2024-08-04T00:47:29Z) - Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models [3.1511847280063696]
Speech enabled foundation models can perform tasks other than automatic speech recognition using an appropriate prompt.
With the development of audio-prompted large language models there is the potential for even greater control options.
We demonstrate that with this greater flexibility the systems can be susceptible to model-control adversarial attacks.
arXiv Detail & Related papers (2024-07-05T13:04:31Z) - Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - Disentanglement in a GAN for Unconditional Speech Synthesis [28.998590651956153]
We propose AudioStyleGAN -- a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space.
ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer.
We apply it on the small-vocabulary Google Speech Commands digits dataset, where it achieves state-of-the-art results in unconditional speech synthesis.
arXiv Detail & Related papers (2023-07-04T12:06:07Z) - Co-Speech Gesture Synthesis using Discrete Gesture Token Learning [1.1694169299062596]
Synthesizing realistic co-speech gestures is an important and yet unsolved problem for creating believable motions.
One challenge in learning the co-speech gesture model is that there may be multiple viable gesture motions for the same speech utterance.
We proposed a two-stage model to address this uncertainty issue in gesture synthesis by modeling the gesture segments as discrete latent codes.
arXiv Detail & Related papers (2023-03-04T01:42:09Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Incremental Text to Speech for Neural Sequence-to-Sequence Models using
Reinforcement Learning [60.20205278845412]
Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised.
This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation.
We propose a reinforcement learning based framework to train an agent to make this decision.
arXiv Detail & Related papers (2020-08-07T11:48:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.