Chain of Generation: Multi-Modal Gesture Synthesis via Cascaded
Conditional Control
- URL: http://arxiv.org/abs/2312.15900v1
- Date: Tue, 26 Dec 2023 06:30:14 GMT
- Title: Chain of Generation: Multi-Modal Gesture Synthesis via Cascaded
Conditional Control
- Authors: Zunnan Xu, Yachao Zhang, Sicheng Yang, Ronghui Li, Xiu Li
- Abstract summary: This study aims to improve the generation of 3D gestures by utilizing multimodal information from human speech.
We introduce a novel method that separates priors from speech and employs multimodal priors as constraints for generating gestures.
- Score: 26.31638205831119
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study aims to improve the generation of 3D gestures by utilizing
multimodal information from human speech. Previous studies have focused on
incorporating additional modalities to enhance the quality of generated
gestures. However, these methods perform poorly when certain modalities are
missing during inference. To address this problem, we suggest using
speech-derived multimodal priors to improve gesture generation. We introduce a
novel method that separates priors from speech and employs multimodal priors as
constraints for generating gestures. Our approach utilizes a chain-like
modeling method to generate facial blendshapes, body movements, and hand
gestures sequentially. Specifically, we incorporate rhythm cues derived from
facial deformation and stylization prior based on speech emotions, into the
process of generating gestures. By incorporating multimodal priors, our method
improves the quality of generated gestures and eliminate the need for expensive
setup preparation during inference. Extensive experiments and user studies
confirm that our proposed approach achieves state-of-the-art performance.
Related papers
- ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation [72.85685916829321]
DiffSHEG is a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length.
By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.
arXiv Detail & Related papers (2024-01-09T11:38:18Z) - Customizing Motion in Text-to-Video Diffusion Models [79.4121510826141]
We introduce an approach for augmenting text-to-video generation models with customized motions.
By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios.
arXiv Detail & Related papers (2023-12-07T18:59:03Z) - A Unified Framework for Multimodal, Multi-Part Human Motion Synthesis [17.45562922442149]
We introduce a cohesive and scalable approach that consolidates multimodal (text, music, speech) and multi-part (hand, torso) human motion generation.
Our method frames the multimodal motion generation challenge as a token prediction task, drawing from specialized codebooks based on the modality of the control signal.
arXiv Detail & Related papers (2023-11-28T04:13:49Z) - Co-Speech Gesture Detection through Multi-Phase Sequence Labeling [3.924524252255593]
We introduce a novel framework that reframes the task as a multi-phase sequence labeling problem.
We evaluate our proposal on a large dataset of diverse co-speech gestures in task-oriented face-to-face dialogues.
arXiv Detail & Related papers (2023-08-21T12:27:18Z) - Synthesizing Long-Term Human Motions with Diffusion Models via Coherent
Sampling [74.62570964142063]
Text-to-motion generation has gained increasing attention, but most existing methods are limited to generating short-term motions.
We propose a novel approach that utilizes a past-conditioned diffusion model with two optional coherent sampling methods.
Our proposed method is capable of generating compositional and coherent long-term 3D human motions controlled by a user-instructed long text stream.
arXiv Detail & Related papers (2023-08-03T16:18:32Z) - MPE4G: Multimodal Pretrained Encoder for Co-Speech Gesture Generation [18.349024345195318]
We propose a novel framework with a multimodal pre-trained encoder for co-speech gesture generation.
The proposed method renders realistic co-speech gestures not only when all input modalities are given but also when the input modalities are missing or noisy.
arXiv Detail & Related papers (2023-05-25T05:42:58Z) - Diffusion Action Segmentation [63.061058214427085]
We propose a novel framework via denoising diffusion models, which shares the same inherent spirit of such iterative refinement.
In this framework, action predictions are iteratively generated from random noise with input video features as conditions.
arXiv Detail & Related papers (2023-03-31T10:53:24Z) - Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks.
Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix.
Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z) - Multi-modal Fusion for Single-Stage Continuous Gesture Recognition [45.19890687786009]
We introduce a single-stage continuous gesture recognition framework, called Temporal Multi-Modal Fusion (TMMF)
TMMF can detect and classify multiple gestures in a video via a single model.
This approach learns the natural transitions between gestures and non-gestures without the need for a pre-processing segmentation step.
arXiv Detail & Related papers (2020-11-10T07:09:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.