Chain of Generation: Multi-Modal Gesture Synthesis via Cascaded
Conditional Control
- URL: http://arxiv.org/abs/2312.15900v1
- Date: Tue, 26 Dec 2023 06:30:14 GMT
- Title: Chain of Generation: Multi-Modal Gesture Synthesis via Cascaded
Conditional Control
- Authors: Zunnan Xu, Yachao Zhang, Sicheng Yang, Ronghui Li, Xiu Li
- Abstract summary: This study aims to improve the generation of 3D gestures by utilizing multimodal information from human speech.
We introduce a novel method that separates priors from speech and employs multimodal priors as constraints for generating gestures.
- Score: 26.31638205831119
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study aims to improve the generation of 3D gestures by utilizing
multimodal information from human speech. Previous studies have focused on
incorporating additional modalities to enhance the quality of generated
gestures. However, these methods perform poorly when certain modalities are
missing during inference. To address this problem, we suggest using
speech-derived multimodal priors to improve gesture generation. We introduce a
novel method that separates priors from speech and employs multimodal priors as
constraints for generating gestures. Our approach utilizes a chain-like
modeling method to generate facial blendshapes, body movements, and hand
gestures sequentially. Specifically, we incorporate rhythm cues derived from
facial deformation and stylization prior based on speech emotions, into the
process of generating gestures. By incorporating multimodal priors, our method
improves the quality of generated gestures and eliminate the need for expensive
setup preparation during inference. Extensive experiments and user studies
confirm that our proposed approach achieves state-of-the-art performance.
Related papers
- HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation [42.30003982604611]
Co-speech gestures are crucial non-verbal cues that enhance speech clarity and strides in human communication.
We propose a novel method named HOP for co-speech gesture generation, capturing heterogeneous entanglement between gesture motion, audio rhythm, and text semantics.
HOP achieves state-of-the-art offering more natural and expressive co-speech gesture generation.
arXiv Detail & Related papers (2025-03-03T04:47:39Z) - Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations? [55.99654128127689]
Cross-modal contrastive distillation has recently been explored for learning effective 3D representations.
Existing methods focus primarily on modality-shared features, neglecting the modality-specific features during the pre-training process.
We propose a new framework, namely CMCR, to address these shortcomings.
arXiv Detail & Related papers (2024-12-12T06:09:49Z) - Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis [55.45253486141108]
RAG-Gesture is a diffusion-based gesture generation approach to produce semantically rich gestures.
We achieve this by using explicit domain knowledge to retrieve motions from a database of co-speech gestures.
We propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence.
arXiv Detail & Related papers (2024-12-09T18:59:46Z) - MotionRL: Align Text-to-Motion Generation to Human Preferences with Multi-Reward Reinforcement Learning [99.09906827676748]
We introduce MotionRL, the first approach to utilize Multi-Reward Reinforcement Learning (RL) for optimizing text-to-motion generation tasks.
Our novel approach uses reinforcement learning to fine-tune the motion generator based on human preferences prior knowledge of the human perception model.
In addition, MotionRL introduces a novel multi-objective optimization strategy to approximate optimality between text adherence, motion quality, and human preferences.
arXiv Detail & Related papers (2024-10-09T03:27:14Z) - KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding [19.15471840100407]
We present a novel approach for synthesizing 3D facial motions from audio sequences using key motion embeddings.
Our method integrates linguistic and data-driven priors through two modules: the linguistic-based key motion acquisition and the cross-modal motion completion.
The latter extends key motions into a full sequence of 3D talking faces guided by audio features, improving temporal coherence and audio-visual consistency.
arXiv Detail & Related papers (2024-09-02T09:41:24Z) - High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation.
We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw.
Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation [72.85685916829321]
DiffSHEG is a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length.
By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.
arXiv Detail & Related papers (2024-01-09T11:38:18Z) - Customizing Motion in Text-to-Video Diffusion Models [79.4121510826141]
We introduce an approach for augmenting text-to-video generation models with customized motions.
By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios.
arXiv Detail & Related papers (2023-12-07T18:59:03Z) - A Unified Framework for Multimodal, Multi-Part Human Motion Synthesis [17.45562922442149]
We introduce a cohesive and scalable approach that consolidates multimodal (text, music, speech) and multi-part (hand, torso) human motion generation.
Our method frames the multimodal motion generation challenge as a token prediction task, drawing from specialized codebooks based on the modality of the control signal.
arXiv Detail & Related papers (2023-11-28T04:13:49Z) - MPE4G: Multimodal Pretrained Encoder for Co-Speech Gesture Generation [18.349024345195318]
We propose a novel framework with a multimodal pre-trained encoder for co-speech gesture generation.
The proposed method renders realistic co-speech gestures not only when all input modalities are given but also when the input modalities are missing or noisy.
arXiv Detail & Related papers (2023-05-25T05:42:58Z) - Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks.
Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix.
Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z) - Multi-modal Fusion for Single-Stage Continuous Gesture Recognition [45.19890687786009]
We introduce a single-stage continuous gesture recognition framework, called Temporal Multi-Modal Fusion (TMMF)
TMMF can detect and classify multiple gestures in a video via a single model.
This approach learns the natural transitions between gestures and non-gestures without the need for a pre-processing segmentation step.
arXiv Detail & Related papers (2020-11-10T07:09:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.