BodyFormer: Semantics-guided 3D Body Gesture Synthesis with Transformer
- URL: http://arxiv.org/abs/2310.06851v1
- Date: Thu, 7 Sep 2023 01:11:11 GMT
- Title: BodyFormer: Semantics-guided 3D Body Gesture Synthesis with Transformer
- Authors: Kunkun Pang, Dafei Qin, Yingruo Fan, Julian Habekost, Takaaki
Shiratori, Junichi Yamagishi, Taku Komura
- Abstract summary: We propose a novel framework for automatic 3D body gesture synthesis from speech.
Our system is trained with either the Trinity speech-gesture dataset or the Talking With Hands 16.2M dataset.
The results show that our system can produce more realistic, appropriate, and diverse body gestures compared to existing state-of-the-art approaches.
- Score: 42.87095473590205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic gesture synthesis from speech is a topic that has attracted
researchers for applications in remote communication, video games and
Metaverse. Learning the mapping between speech and 3D full-body gestures is
difficult due to the stochastic nature of the problem and the lack of a rich
cross-modal dataset that is needed for training. In this paper, we propose a
novel transformer-based framework for automatic 3D body gesture synthesis from
speech. To learn the stochastic nature of the body gesture during speech, we
propose a variational transformer to effectively model a probabilistic
distribution over gestures, which can produce diverse gestures during
inference. Furthermore, we introduce a mode positional embedding layer to
capture the different motion speeds in different speaking modes. To cope with
the scarcity of data, we design an intra-modal pre-training scheme that can
learn the complex mapping between the speech and the 3D gesture from a limited
amount of data. Our system is trained with either the Trinity speech-gesture
dataset or the Talking With Hands 16.2M dataset. The results show that our
system can produce more realistic, appropriate, and diverse body gestures
compared to existing state-of-the-art approaches.
Related papers
- CoCoGesture: Toward Coherent Co-speech 3D Gesture Generation in the Wild [44.401536230814465]
CoCoGesture is a novel framework enabling vivid and diverse gesture synthesis from unseen human speech prompts.
Our key insight is built upon the custom-designed pretrain-fintune training paradigm.
Our proposed CoCoGesture outperforms the state-of-the-art methods on the zero-shot speech-to-gesture generation.
arXiv Detail & Related papers (2024-05-27T06:47:14Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks,
Methods, and Applications [20.842799581850617]
We consider the task of animating 3D facial geometry from speech signal.
Existing works are primarily deterministic, focusing on learning a one-to-one mapping from speech signal to 3D face meshes on small datasets with limited speakers.
arXiv Detail & Related papers (2023-11-30T01:14:43Z) - Co-Speech Gesture Synthesis using Discrete Gesture Token Learning [1.1694169299062596]
Synthesizing realistic co-speech gestures is an important and yet unsolved problem for creating believable motions.
One challenge in learning the co-speech gesture model is that there may be multiple viable gesture motions for the same speech utterance.
We proposed a two-stage model to address this uncertainty issue in gesture synthesis by modeling the gesture segments as discrete latent codes.
arXiv Detail & Related papers (2023-03-04T01:42:09Z) - Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech.
We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Learning Speech-driven 3D Conversational Gestures from Video [106.15628979352738]
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures.
Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures.
We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
arXiv Detail & Related papers (2021-02-13T01:05:39Z) - Body2Hands: Learning to Infer 3D Hands from Conversational Gesture Body
Dynamics [87.17505994436308]
We build upon the insight that body motion and hand gestures are strongly correlated in non-verbal communication settings.
We formulate the learning of this prior as a prediction task of 3D hand shape over time given body motion input alone.
Our hand prediction model produces convincing 3D hand gestures given only the 3D motion of the speaker's arms as input.
arXiv Detail & Related papers (2020-07-23T22:58:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.