Re-ENACT: Reinforcement Learning for Emotional Speech Generation using Actor-Critic Strategy
- URL: http://arxiv.org/abs/2408.01892v1
- Date: Sun, 4 Aug 2024 00:47:29 GMT
- Title: Re-ENACT: Reinforcement Learning for Emotional Speech Generation using Actor-Critic Strategy
- Authors: Ravi Shankar, Archana Venkataraman,
- Abstract summary: We train a neural network to produce the variational posterior of a collection of Bernoulli random variables.
We modify the prosodic features of a masked segment to increase the score of target emotion.
Our experiments demonstrate that this framework changes the perceived emotion of a given speech utterance to the target.
- Score: 8.527959937101826
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose the first method to modify the prosodic features of a given speech signal using actor-critic reinforcement learning strategy. Our approach uses a Bayesian framework to identify contiguous segments of importance that links segments of the given utterances to perception of emotions in humans. We train a neural network to produce the variational posterior of a collection of Bernoulli random variables; our model applies a Markov prior on it to ensure continuity. A sample from this distribution is used for downstream emotion prediction. Further, we train the neural network to predict a soft assignment over emotion categories as the target variable. In the next step, we modify the prosodic features (pitch, intensity, and rhythm) of the masked segment to increase the score of target emotion. We employ an actor-critic reinforcement learning to train the prosody modifier by discretizing the space of modifications. Further, it provides a simple solution to the problem of gradient computation through WSOLA operation for rhythm manipulation. Our experiments demonstrate that this framework changes the perceived emotion of a given speech utterance to the target. Further, we show that our unified technique is on par with state-of-the-art emotion conversion models from supervised and unsupervised domains that require pairwise training.
Related papers
- Boosting Continuous Emotion Recognition with Self-Pretraining using Masked Autoencoders, Temporal Convolutional Networks, and Transformers [3.951847822557829]
We tackle the Valence-Arousal (VA) Estimation Challenge, Expression (Expr) Classification Challenge, and Action Unit (AU) Detection Challenge.
Our study advocates a novel approach aimed at refining continuous emotion recognition.
We achieve this by pre-training with Masked Autoencoders (MAE) on facial datasets, followed by fine-tuning on the aff-wild2 dataset annotated with expression (Expr) labels.
arXiv Detail & Related papers (2024-03-18T03:28:01Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - Co-Speech Gesture Detection through Multi-Phase Sequence Labeling [3.924524252255593]
We introduce a novel framework that reframes the task as a multi-phase sequence labeling problem.
We evaluate our proposal on a large dataset of diverse co-speech gestures in task-oriented face-to-face dialogues.
arXiv Detail & Related papers (2023-08-21T12:27:18Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - GANtron: Emotional Speech Synthesis with Generative Adversarial Networks [0.0]
We propose a text-to-speech model where the inferred speech can be tuned with the desired emotions.
We use Generative Adversarial Networks (GANs) together with a sequence-to-sequence model using an attention mechanism.
arXiv Detail & Related papers (2021-10-06T10:44:30Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism.
Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes.
Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z) - Emotion Transfer Using Vector-Valued Infinite Task Learning [2.588412672658578]
We present a novel style transfer framework building upon infinite task learning and vector-valued reproducing kernel Hilbert spaces.
We instantiate the idea in emotion transfer where the goal is to transform facial images to different target emotions.
arXiv Detail & Related papers (2021-02-09T19:05:56Z) - Seen and Unseen emotional style transfer for voice conversion with a new
emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity.
We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN)
We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.