Related papers: A Unified Framework for Multimodal, Multi-Part Human Motion Synthesis

A Unified Framework for Multimodal, Multi-Part Human Motion Synthesis

URL: http://arxiv.org/abs/2311.16471v1
Date: Tue, 28 Nov 2023 04:13:49 GMT
Title: A Unified Framework for Multimodal, Multi-Part Human Motion Synthesis
Authors: Zixiang Zhou, Yu Wan, Baoyuan Wang
Abstract summary: We introduce a cohesive and scalable approach that consolidates multimodal (text, music, speech) and multi-part (hand, torso) human motion generation. Our method frames the multimodal motion generation challenge as a token prediction task, drawing from specialized codebooks based on the modality of the control signal.
Score: 17.45562922442149
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The field has made significant progress in synthesizing realistic human motion driven by various modalities. Yet, the need for different methods to animate various body parts according to different control signals limits the scalability of these techniques in practical scenarios. In this paper, we introduce a cohesive and scalable approach that consolidates multimodal (text, music, speech) and multi-part (hand, torso) human motion generation. Our methodology unfolds in several steps: We begin by quantizing the motions of diverse body parts into separate codebooks tailored to their respective domains. Next, we harness the robust capabilities of pre-trained models to transcode multimodal signals into a shared latent space. We then translate these signals into discrete motion tokens by iteratively predicting subsequent tokens to form a complete sequence. Finally, we reconstruct the continuous actual motion from this tokenized sequence. Our method frames the multimodal motion generation challenge as a token prediction task, drawing from specialized codebooks based on the modality of the control signal. This approach is inherently scalable, allowing for the easy integration of new modalities. Extensive experiments demonstrated the effectiveness of our design, emphasizing its potential for broad application.

Related papers

VersatileMotion: A Unified Framework for Motion Synthesis and Comprehension [26.172040706657235]
We introduce VersatileMotion, a unified motion LLM that combines a novel motion tokenizer, integrating VQ-VAE with flow matching, and an autoregressive transformer backbone.<n> VersatileMotion is the first method to handle single-agent and multi-agent motions in a single framework, achieving state-of-the-art performance on seven of these tasks.
arXiv Detail & Related papers (2024-11-26T11:28:01Z)
Multi-Resolution Generative Modeling of Human Motion from Limited Data [3.5229503563299915]
We present a generative model that learns to synthesize human motion from limited training sequences. The model adeptly captures human motion patterns by integrating skeletal convolution layers and a multi-scale architecture.
arXiv Detail & Related papers (2024-11-25T15:36:29Z)
Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers [13.665279127648658]
This research presents a novel motion generation framework designed to produce whole-body motion sequences conditioned on multiple modalities simultaneously. By integrating spatial attention mechanisms and a token critic we ensure consistency and naturalness in the generated motions.
arXiv Detail & Related papers (2024-09-03T04:19:27Z)
FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis [65.85686550683806]
This paper reconsiders motion generation and proposes to unify the single and multi-person motion by the conditional motion distribution. Based on our framework, the current single-person motion spatial control method could be seamlessly integrated, achieving precise control of multi-person motion.
arXiv Detail & Related papers (2024-05-24T17:57:57Z)
MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models [22.044020889631188]
We introduce MambaTalk, enhancing gesture diversity and rhythm through multimodal integration. Our method matches or exceeds the performance of state-of-the-art models.
arXiv Detail & Related papers (2024-03-14T15:10:54Z)
InterControl: Zero-shot Human Interaction Generation by Controlling Every Joint [67.6297384588837]
We introduce a novel controllable motion generation method, InterControl, to encourage the synthesized motions maintaining the desired distance between joint pairs. We demonstrate that the distance between joint pairs for human-wise interactions can be generated using an off-the-shelf Large Language Model.
arXiv Detail & Related papers (2023-11-27T14:32:33Z)
DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions. We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z)
Executing your Commands via Motion Diffusion in Latent Space [51.64652463205012]
We propose a Motion Latent-based Diffusion model (MLD) to produce vivid motion sequences conforming to the given conditional inputs. Our MLD achieves significant improvements over the state-of-the-art methods among extensive human motion generation tasks.
arXiv Detail & Related papers (2022-12-08T03:07:00Z)
MoDi: Unconditional Motion Synthesis from Diverse Data [51.676055380546494]
We present MoDi, an unconditional generative model that synthesizes diverse motions. Our model is trained in a completely unsupervised setting from a diverse, unstructured and unlabeled motion dataset. We show that despite the lack of any structure in the dataset, the latent space can be semantically clustered.
arXiv Detail & Related papers (2022-06-16T09:06:25Z)
Recurrent Transformer Variational Autoencoders for Multi-Action Motion Synthesis [17.15415641710113]
We consider the problem of synthesizing multi-action human motion sequences of arbitrary lengths. Existing approaches have mastered motion sequence generation in single-action scenarios, but fail to generalize to multi-action and arbitrary-length sequences. We propose a novel efficient approach that leverages the richness of Recurrent Transformers and generative richness of conditional Variational Autoencoders.
arXiv Detail & Related papers (2022-06-14T10:40:16Z)
Towards Diverse and Natural Scene-aware 3D Human Motion Synthesis [117.15586710830489]
We focus on the problem of synthesizing diverse scene-aware human motions under the guidance of target action sequences. Based on this factorized scheme, a hierarchical framework is proposed, with each sub-module responsible for modeling one aspect. Experiment results show that the proposed framework remarkably outperforms previous methods in terms of diversity and naturalness.
arXiv Detail & Related papers (2022-05-25T18:20:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.