MPE4G: Multimodal Pretrained Encoder for Co-Speech Gesture Generation
- URL: http://arxiv.org/abs/2305.15740v1
- Date: Thu, 25 May 2023 05:42:58 GMT
- Title: MPE4G: Multimodal Pretrained Encoder for Co-Speech Gesture Generation
- Authors: Gwantae Kim, Seonghyeok Noh, Insung Ham and Hanseok Ko
- Abstract summary: We propose a novel framework with a multimodal pre-trained encoder for co-speech gesture generation.
The proposed method renders realistic co-speech gestures not only when all input modalities are given but also when the input modalities are missing or noisy.
- Score: 18.349024345195318
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: When virtual agents interact with humans, gestures are crucial to delivering
their intentions with speech. Previous multimodal co-speech gesture generation
models required encoded features of all modalities to generate gestures. If
some input modalities are removed or contain noise, the model may not generate
the gestures properly. To acquire robust and generalized encodings, we propose
a novel framework with a multimodal pre-trained encoder for co-speech gesture
generation. In the proposed method, the multi-head-attention-based encoder is
trained with self-supervised learning to contain the information on each
modality. Moreover, we collect full-body gestures that consist of 3D joint
rotations to improve visualization and apply gestures to the extensible body
model. Through the series of experiments and human evaluation, the proposed
method renders realistic co-speech gestures not only when all input modalities
are given but also when the input modalities are missing or noisy.
Related papers
- Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - CoCoGesture: Toward Coherent Co-speech 3D Gesture Generation in the Wild [44.401536230814465]
CoCoGesture is a novel framework enabling vivid and diverse gesture synthesis from unseen human speech prompts.
Our key insight is built upon the custom-designed pretrain-fintune training paradigm.
Our proposed CoCoGesture outperforms the state-of-the-art methods on the zero-shot speech-to-gesture generation.
arXiv Detail & Related papers (2024-05-27T06:47:14Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Co-Speech Gesture Synthesis using Discrete Gesture Token Learning [1.1694169299062596]
Synthesizing realistic co-speech gestures is an important and yet unsolved problem for creating believable motions.
One challenge in learning the co-speech gesture model is that there may be multiple viable gesture motions for the same speech utterance.
We proposed a two-stage model to address this uncertainty issue in gesture synthesis by modeling the gesture segments as discrete latent codes.
arXiv Detail & Related papers (2023-03-04T01:42:09Z) - SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation [89.47132156950194]
We present a novel framework built to simplify 3D asset generation for amateur users.
Our method supports a variety of input modalities that can be easily provided by a human.
Our model can combine all these tasks into one swiss-army-knife tool.
arXiv Detail & Related papers (2022-12-08T18:59:05Z) - i-Code: An Integrative and Composable Multimodal Learning Framework [99.56065789066027]
i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations.
The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning.
Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
arXiv Detail & Related papers (2022-05-03T23:38:50Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z) - Speech Gesture Generation from the Trimodal Context of Text, Audio, and
Speaker Identity [21.61168067832304]
We present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures.
Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models.
arXiv Detail & Related papers (2020-09-04T11:42:45Z) - Gesticulator: A framework for semantically-aware speech-driven gesture
generation [17.284154896176553]
We present a model designed to produce arbitrary beat and semantic gestures together.
Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output.
The resulting gestures can be applied to both virtual agents and humanoid robots.
arXiv Detail & Related papers (2020-01-25T14:42:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.