Zero-Shot Style Transfer for Gesture Animation driven by Text and Speech
using Adversarial Disentanglement of Multimodal Style Encoding
- URL: http://arxiv.org/abs/2208.01917v1
- Date: Wed, 3 Aug 2022 08:49:55 GMT
- Title: Zero-Shot Style Transfer for Gesture Animation driven by Text and Speech
using Adversarial Disentanglement of Multimodal Style Encoding
- Authors: Mireille Fares, Michele Grimaldi, Catherine Pelachaud, Nicolas Obin
- Abstract summary: We propose an efficient yet effective machine learning approach to synthesize gestures driven by prosodic features and text in the style of different speakers.
Our model performs zero shot multimodal style transfer driven by multimodal data from the PATS database containing videos of various speakers.
- Score: 3.2116198597240846
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modeling virtual agents with behavior style is one factor for personalizing
human agent interaction. We propose an efficient yet effective machine learning
approach to synthesize gestures driven by prosodic features and text in the
style of different speakers including those unseen during training. Our model
performs zero shot multimodal style transfer driven by multimodal data from the
PATS database containing videos of various speakers. We view style as being
pervasive while speaking, it colors the communicative behaviors expressivity
while speech content is carried by multimodal signals and text. This
disentanglement scheme of content and style allows us to directly infer the
style embedding even of speaker whose data are not part of the training phase,
without requiring any further training or fine tuning. The first goal of our
model is to generate the gestures of a source speaker based on the content of
two audio and text modalities. The second goal is to condition the source
speaker predicted gestures on the multimodal behavior style embedding of a
target speaker. The third goal is to allow zero shot style transfer of speakers
unseen during training without retraining the model. Our system consists of:
(1) a speaker style encoder network that learns to generate a fixed dimensional
speaker embedding style from a target speaker multimodal data and (2) a
sequence to sequence synthesis network that synthesizes gestures based on the
content of the input modalities of a source speaker and conditioned on the
speaker style embedding. We evaluate that our model can synthesize gestures of
a source speaker and transfer the knowledge of target speaker style variability
to the gesture generation task in a zero shot setup. We convert the 2D gestures
to 3D poses and produce 3D animations. We conduct objective and subjective
evaluations to validate our approach and compare it with a baseline.
Related papers
- ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Freetalker: Controllable Speech and Text-Driven Gesture Generation Based
on Diffusion Models for Enhanced Speaker Naturalness [45.90256126021112]
We introduce FreeTalker, which is the first framework for the generation of both spontaneous (e.g., co-speech gesture) and non-spontaneous (e.g., moving around the podium) speaker motions.
Specifically, we train a diffusion-based model for speaker motion generation that employs unified representations of both speech-driven gestures and text-driven motions.
arXiv Detail & Related papers (2024-01-07T13:01:29Z) - Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer [53.72998363956454]
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy.
The scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation.
We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and timbre units.
arXiv Detail & Related papers (2023-09-14T09:52:08Z) - Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model [2.827070255699381]
diffmotion-v2 is a speech-conditional diffusion-based generative model with WavLM pre-trained model.
It can produce individual and stylized full-body co-speech gestures only using raw speech audio.
arXiv Detail & Related papers (2023-08-11T08:03:28Z) - ZS-MSTM: Zero-Shot Style Transfer for Gesture Animation driven by Text
and Speech using Adversarial Disentanglement of Multimodal Style Encoding [3.609538870261841]
We propose a machine learning approach to synthesize gestures, driven by prosodic features and text, in the style of different speakers.
Our model incorporates zero-shot multimodal style transfer using multimodal data from the PATS database.
arXiv Detail & Related papers (2023-05-22T10:10:35Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - Speech Gesture Generation from the Trimodal Context of Text, Audio, and
Speaker Identity [21.61168067832304]
We present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures.
Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models.
arXiv Detail & Related papers (2020-09-04T11:42:45Z) - Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker
Conditional-Mixture Approach [46.50460811211031]
Key challenge is to learn a model that generates gestures for a speaking agent 'A' in the gesturing style of a target speaker 'B'
We propose Mix-StAGE, which trains a single model for multiple speakers while learning unique style embeddings for each speaker's gestures.
As Mix-StAGE disentangles style and content of gestures, gesturing styles for the same input speech can be altered by simply switching the style embeddings.
arXiv Detail & Related papers (2020-07-24T15:01:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.