Enabling Synergistic Full-Body Control in Prompt-Based Co-Speech Motion Generation
- URL: http://arxiv.org/abs/2410.00464v1
- Date: Tue, 1 Oct 2024 07:46:05 GMT
- Title: Enabling Synergistic Full-Body Control in Prompt-Based Co-Speech Motion Generation
- Authors: Bohong Chen, Yumeng Li, Yao-Xiang Ding, Tianjia Shao, Kun Zhou,
- Abstract summary: Co-speech motion generation approaches usually focus on upper body gestures following speech contents only.
Existing speech-to-motion datasets only involve highly limited full-body motions.
We propose SynTalker, which utilizes the off-the-shelf text-to-motion dataset as an auxiliary.
- Score: 32.70952356211433
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current co-speech motion generation approaches usually focus on upper body gestures following speech contents only, while lacking supporting the elaborate control of synergistic full-body motion based on text prompts, such as talking while walking. The major challenges lie in 1) the existing speech-to-motion datasets only involve highly limited full-body motions, making a wide range of common human activities out of training distribution; 2) these datasets also lack annotated user prompts. To address these challenges, we propose SynTalker, which utilizes the off-the-shelf text-to-motion dataset as an auxiliary for supplementing the missing full-body motion and prompts. The core technical contributions are two-fold. One is the multi-stage training process which obtains an aligned embedding space of motion, speech, and prompts despite the significant distributional mismatch in motion between speech-to-motion and text-to-motion datasets. Another is the diffusion-based conditional inference process, which utilizes the separate-then-combine strategy to realize fine-grained control of local body parts. Extensive experiments are conducted to verify that our approach supports precise and flexible control of synergistic full-body motion generation based on both speeches and user prompts, which is beyond the ability of existing approaches.
Related papers
- It Takes Two: Real-time Co-Speech Two-person's Interaction Generation via Reactive Auto-regressive Diffusion Model [34.94330722832987]
We introduce an audio-driven, auto-regressive system designed to synthesize dynamic movements for two characters during a conversation.
To the best of our knowledge, this is the first system capable of generating interactive full-body motions for two characters from speech in an online manner.
arXiv Detail & Related papers (2024-12-03T12:31:44Z) - BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis [0.4893345190925178]
BiPO is a novel model that enhances text-to-motion synthesis.
It integrates part-based generation with a bidirectional autoregressive architecture.
BiPO achieves state-of-the-art performance on the HumanML3D dataset.
arXiv Detail & Related papers (2024-11-28T05:42:47Z) - KinMo: Kinematic-aware Human Motion Understanding and Generation [6.962697597686156]
Controlling human motion based on text presents an important challenge in computer vision.
Traditional approaches often rely on holistic action descriptions for motion synthesis.
We propose a novel motion representation that decomposes motion into distinct body joint group movements.
arXiv Detail & Related papers (2024-11-23T06:50:11Z) - MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding [76.30210465222218]
MotionGPT-2 is a unified Large Motion-Language Model (LMLMLM)
It supports multimodal control conditions through pre-trained Large Language Models (LLMs)
It is highly adaptable to the challenging 3D holistic motion generation task.
arXiv Detail & Related papers (2024-10-29T05:25:34Z) - Autonomous Character-Scene Interaction Synthesis from Text Instruction [45.255215402142596]
We introduce a framework for synthesizing multi-stage scene-aware interaction motions directly from a single text instruction and goal location.
Our approach employs an auto-regressive diffusion model to synthesize the next motion segment, along with an autonomous scheduler predicting the transition for each action stage.
We present a comprehensive motion-captured dataset comprising 16 hours of motion sequences in 120 indoor scenes covering 40 types of motions, each annotated with precise language descriptions.
arXiv Detail & Related papers (2024-10-04T06:58:45Z) - Learning Generalizable Human Motion Generator with Reinforcement Learning [95.62084727984808]
Text-driven human motion generation is one of the vital tasks in computer-aided content creation.
Existing methods often overfit specific motion expressions in the training data, hindering their ability to generalize.
We present textbfInstructMotion, which incorporate the trail and error paradigm in reinforcement learning for generalizable human motion generation.
arXiv Detail & Related papers (2024-05-24T13:29:12Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Towards Event Extraction from Speech with Contextual Clues [61.164413398231254]
We introduce the Speech Event Extraction (SpeechEE) task and construct three synthetic training sets and one human-spoken test set.
Compared to event extraction from text, SpeechEE poses greater challenges mainly due to complex speech signals that are continuous and have no word boundaries.
Our method brings significant improvements on all datasets, achieving a maximum F1 gain of 10.7%.
arXiv Detail & Related papers (2024-01-27T11:07:19Z) - Freetalker: Controllable Speech and Text-Driven Gesture Generation Based
on Diffusion Models for Enhanced Speaker Naturalness [45.90256126021112]
We introduce FreeTalker, which is the first framework for the generation of both spontaneous (e.g., co-speech gesture) and non-spontaneous (e.g., moving around the podium) speaker motions.
Specifically, we train a diffusion-based model for speaker motion generation that employs unified representations of both speech-driven gestures and text-driven motions.
arXiv Detail & Related papers (2024-01-07T13:01:29Z) - SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD)
The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences.
Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z) - DiverseMotion: Towards Diverse Human Motion Generation via Discrete
Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions.
We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z) - The GENEA Challenge 2023: A large scale evaluation of gesture generation
models in monadic and dyadic settings [8.527975206444742]
This paper reports on the GENEA Challenge 2023, in which participating teams built speech-driven gesture-generation systems.
We evaluated 12 submissions and 2 baselines together with held-out motion-capture data in several large-scale user studies.
We found a large span in human-likeness between challenge submissions, with a few systems rated close to human mocap.
arXiv Detail & Related papers (2023-08-24T08:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.