ZS-MSTM: Zero-Shot Style Transfer for Gesture Animation driven by Text
and Speech using Adversarial Disentanglement of Multimodal Style Encoding
- URL: http://arxiv.org/abs/2305.12887v1
- Date: Mon, 22 May 2023 10:10:35 GMT
- Title: ZS-MSTM: Zero-Shot Style Transfer for Gesture Animation driven by Text
and Speech using Adversarial Disentanglement of Multimodal Style Encoding
- Authors: Mireille Fares, Catherine Pelachaud, Nicolas Obin
- Abstract summary: We propose a machine learning approach to synthesize gestures, driven by prosodic features and text, in the style of different speakers.
Our model incorporates zero-shot multimodal style transfer using multimodal data from the PATS database.
- Score: 3.609538870261841
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this study, we address the importance of modeling behavior style in
virtual agents for personalized human-agent interaction. We propose a machine
learning approach to synthesize gestures, driven by prosodic features and text,
in the style of different speakers, even those unseen during training. Our
model incorporates zero-shot multimodal style transfer using multimodal data
from the PATS database, which contains videos of diverse speakers. We recognize
style as a pervasive element during speech, influencing the expressivity of
communicative behaviors, while content is conveyed through multimodal signals
and text. By disentangling content and style, we directly infer the style
embedding, even for speakers not included in the training phase, without the
need for additional training or fine-tuning. Objective and subjective
evaluations are conducted to validate our approach and compare it against two
baseline methods.
Related papers
- ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model [2.827070255699381]
diffmotion-v2 is a speech-conditional diffusion-based generative model with WavLM pre-trained model.
It can produce individual and stylized full-body co-speech gestures only using raw speech audio.
arXiv Detail & Related papers (2023-08-11T08:03:28Z) - TranSTYLer: Multimodal Behavioral Style Transfer for Facial and Body
Gestures Generation [2.7317088388886384]
This paper addresses the challenge of transferring the behavior expressivity style of a virtual agent to another one.
We propose a multimodal transformer based model that synthesizes the multimodal behaviors of a source speaker with the style of a target speaker.
arXiv Detail & Related papers (2023-08-08T15:42:35Z) - Conversation Style Transfer using Few-Shot Learning [56.43383396058639]
In this paper, we introduce conversation style transfer as a few-shot learning problem.
We propose a novel in-context learning approach to solve the task with style-free dialogues as a pivot.
We show that conversation style transfer can also benefit downstream tasks.
arXiv Detail & Related papers (2023-02-16T15:27:00Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Zero-Shot Style Transfer for Gesture Animation driven by Text and Speech
using Adversarial Disentanglement of Multimodal Style Encoding [3.2116198597240846]
We propose an efficient yet effective machine learning approach to synthesize gestures driven by prosodic features and text in the style of different speakers.
Our model performs zero shot multimodal style transfer driven by multimodal data from the PATS database containing videos of various speakers.
arXiv Detail & Related papers (2022-08-03T08:49:55Z) - Text-driven Emotional Style Control and Cross-speaker Style Transfer in
Neural TTS [7.384726530165295]
Style control of synthetic speech is often restricted to discrete emotion categories.
We propose a text-based interface for emotional style control and cross-speaker style transfer in multi-speaker TTS.
arXiv Detail & Related papers (2022-07-13T07:05:44Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Towards Multi-Scale Style Control for Expressive Speech Synthesis [60.08928435252417]
The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech.
During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
arXiv Detail & Related papers (2021-04-08T05:50:09Z) - Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker
Conditional-Mixture Approach [46.50460811211031]
Key challenge is to learn a model that generates gestures for a speaking agent 'A' in the gesturing style of a target speaker 'B'
We propose Mix-StAGE, which trains a single model for multiple speakers while learning unique style embeddings for each speaker's gestures.
As Mix-StAGE disentangles style and content of gestures, gesturing styles for the same input speech can be altered by simply switching the style embeddings.
arXiv Detail & Related papers (2020-07-24T15:01:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.