Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker
Conditional-Mixture Approach
- URL: http://arxiv.org/abs/2007.12553v1
- Date: Fri, 24 Jul 2020 15:01:02 GMT
- Title: Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker
Conditional-Mixture Approach
- Authors: Chaitanya Ahuja, Dong Won Lee, Yukiko I. Nakano, Louis-Philippe
Morency
- Abstract summary: Key challenge is to learn a model that generates gestures for a speaking agent 'A' in the gesturing style of a target speaker 'B'
We propose Mix-StAGE, which trains a single model for multiple speakers while learning unique style embeddings for each speaker's gestures.
As Mix-StAGE disentangles style and content of gestures, gesturing styles for the same input speech can be altered by simply switching the style embeddings.
- Score: 46.50460811211031
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: How can we teach robots or virtual assistants to gesture naturally? Can we go
further and adapt the gesturing style to follow a specific speaker? Gestures
that are naturally timed with corresponding speech during human communication
are called co-speech gestures. A key challenge, called gesture style transfer,
is to learn a model that generates these gestures for a speaking agent 'A' in
the gesturing style of a target speaker 'B'. A secondary goal is to
simultaneously learn to generate co-speech gestures for multiple speakers while
remembering what is unique about each speaker. We call this challenge style
preservation. In this paper, we propose a new model, named Mix-StAGE, which
trains a single model for multiple speakers while learning unique style
embeddings for each speaker's gestures in an end-to-end manner. A novelty of
Mix-StAGE is to learn a mixture of generative models which allows for
conditioning on the unique gesture style of each speaker. As Mix-StAGE
disentangles style and content of gestures, gesturing styles for the same input
speech can be altered by simply switching the style embeddings. Mix-StAGE also
allows for style preservation when learning simultaneously from multiple
speakers. We also introduce a new dataset, Pose-Audio-Transcript-Style (PATS),
designed to study gesture generation and style transfer. Our proposed Mix-StAGE
model significantly outperforms the previous state-of-the-art approach for
gesture generation and provides a path towards performing gesture style
transfer across multiple speakers. Link to code, data, and videos:
http://chahuja.com/mix-stage
Related papers
- ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Freetalker: Controllable Speech and Text-Driven Gesture Generation Based
on Diffusion Models for Enhanced Speaker Naturalness [45.90256126021112]
We introduce FreeTalker, which is the first framework for the generation of both spontaneous (e.g., co-speech gesture) and non-spontaneous (e.g., moving around the podium) speaker motions.
Specifically, we train a diffusion-based model for speaker motion generation that employs unified representations of both speech-driven gestures and text-driven motions.
arXiv Detail & Related papers (2024-01-07T13:01:29Z) - EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling [57.08286593059137]
We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures.
We first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset.
Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance.
arXiv Detail & Related papers (2023-12-31T02:25:41Z) - Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model [2.827070255699381]
diffmotion-v2 is a speech-conditional diffusion-based generative model with WavLM pre-trained model.
It can produce individual and stylized full-body co-speech gestures only using raw speech audio.
arXiv Detail & Related papers (2023-08-11T08:03:28Z) - ZS-MSTM: Zero-Shot Style Transfer for Gesture Animation driven by Text
and Speech using Adversarial Disentanglement of Multimodal Style Encoding [3.609538870261841]
We propose a machine learning approach to synthesize gestures, driven by prosodic features and text, in the style of different speakers.
Our model incorporates zero-shot multimodal style transfer using multimodal data from the PATS database.
arXiv Detail & Related papers (2023-05-22T10:10:35Z) - Zero-Shot Style Transfer for Gesture Animation driven by Text and Speech
using Adversarial Disentanglement of Multimodal Style Encoding [3.2116198597240846]
We propose an efficient yet effective machine learning approach to synthesize gestures driven by prosodic features and text in the style of different speakers.
Our model performs zero shot multimodal style transfer driven by multimodal data from the PATS database containing videos of various speakers.
arXiv Detail & Related papers (2022-08-03T08:49:55Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Speech Gesture Generation from the Trimodal Context of Text, Audio, and
Speaker Identity [21.61168067832304]
We present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures.
Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models.
arXiv Detail & Related papers (2020-09-04T11:42:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.