GUESS:GradUally Enriching SyntheSis for Text-Driven Human Motion
Generation
- URL: http://arxiv.org/abs/2401.02142v2
- Date: Sat, 6 Jan 2024 03:17:55 GMT
- Title: GUESS:GradUally Enriching SyntheSis for Text-Driven Human Motion
Generation
- Authors: Xuehao Gao, Yang Yang, Zhenyu Xie, Shaoyi Du, Zhongqian Sun, and Yang
Wu
- Abstract summary: We propose a novel cascaded diffusion-based generative framework for text-driven human motion synthesis.
The framework exploits a strategy named GradUally Enriching SyntheSis (GUESS) as its abbreviation.
We show that GUESS outperforms existing state-of-the-art methods by large margins in terms of accuracy, realisticness, and diversity.
- Score: 23.435588151215594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel cascaded diffusion-based generative
framework for text-driven human motion synthesis, which exploits a strategy
named GradUally Enriching SyntheSis (GUESS as its abbreviation). The strategy
sets up generation objectives by grouping body joints of detailed skeletons in
close semantic proximity together and then replacing each of such joint group
with a single body-part node. Such an operation recursively abstracts a human
pose to coarser and coarser skeletons at multiple granularity levels. With
gradually increasing the abstraction level, human motion becomes more and more
concise and stable, significantly benefiting the cross-modal motion synthesis
task. The whole text-driven human motion synthesis problem is then divided into
multiple abstraction levels and solved with a multi-stage generation framework
with a cascaded latent diffusion model: an initial generator first generates
the coarsest human motion guess from a given text description; then, a series
of successive generators gradually enrich the motion details based on the
textual description and the previous synthesized results. Notably, we further
integrate GUESS with the proposed dynamic multi-condition fusion mechanism to
dynamically balance the cooperative effects of the given textual condition and
synthesized coarse motion prompt in different generation stages. Extensive
experiments on large-scale datasets verify that GUESS outperforms existing
state-of-the-art methods by large margins in terms of accuracy, realisticness,
and diversity. Code is available at https://github.com/Xuehao-Gao/GUESS.
Related papers
- Learning Generalizable Human Motion Generator with Reinforcement Learning [95.62084727984808]
Text-driven human motion generation is one of the vital tasks in computer-aided content creation.
Existing methods often overfit specific motion expressions in the training data, hindering their ability to generalize.
We present textbfInstructMotion, which incorporate the trail and error paradigm in reinforcement learning for generalizable human motion generation.
arXiv Detail & Related papers (2024-05-24T13:29:12Z) - Scaling Up Dynamic Human-Scene Interaction Modeling [58.032368564071895]
TRUMANS is the most comprehensive motion-captured HSI dataset currently available.
It intricately captures whole-body human motions and part-level object dynamics.
We devise a diffusion-based autoregressive model that efficiently generates HSI sequences of any length.
arXiv Detail & Related papers (2024-03-13T15:45:04Z) - AMD:Anatomical Motion Diffusion with Interpretable Motion Decomposition
and Fusion [11.689663297469945]
We propose the Adaptable Motion Diffusion model.
It exploits a Large Language Model (LLM) to parse the input text into a sequence of concise and interpretable anatomical scripts.
We then devise a two-branch fusion scheme that balances the influence of the input text and the anatomical scripts on the inverse diffusion process.
arXiv Detail & Related papers (2023-12-20T04:49:45Z) - A Unified Framework for Multimodal, Multi-Part Human Motion Synthesis [17.45562922442149]
We introduce a cohesive and scalable approach that consolidates multimodal (text, music, speech) and multi-part (hand, torso) human motion generation.
Our method frames the multimodal motion generation challenge as a token prediction task, drawing from specialized codebooks based on the modality of the control signal.
arXiv Detail & Related papers (2023-11-28T04:13:49Z) - Act As You Wish: Fine-Grained Control of Motion Diffusion Model with
Hierarchical Semantic Graphs [31.244039305932287]
We propose hierarchical semantic graphs for fine-grained control over motion generation.
We disentangle motion descriptions into hierarchical semantic graphs including three levels of motions, actions, and specifics.
Our method can continuously refine the generated motion, which may have a far-reaching impact on the community.
arXiv Detail & Related papers (2023-11-02T06:20:23Z) - SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD)
The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences.
Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z) - Hierarchical Generation of Human-Object Interactions with Diffusion
Probabilistic Models [71.64318025625833]
This paper presents a novel approach to generating the 3D motion of a human interacting with a target object.
Our framework first generates a set of milestones and then synthesizes the motion along them.
The experiments on the NSM, COUCH, and SAMP datasets show that our approach outperforms previous methods by a large margin in both quality and diversity.
arXiv Detail & Related papers (2023-10-03T17:50:23Z) - MoDi: Unconditional Motion Synthesis from Diverse Data [51.676055380546494]
We present MoDi, an unconditional generative model that synthesizes diverse motions.
Our model is trained in a completely unsupervised setting from a diverse, unstructured and unlabeled motion dataset.
We show that despite the lack of any structure in the dataset, the latent space can be semantically clustered.
arXiv Detail & Related papers (2022-06-16T09:06:25Z) - GANimator: Neural Motion Synthesis from a Single Sequence [38.361579401046875]
We present GANimator, a generative model that learns to synthesize novel motions from a single, short motion sequence.
GANimator generates motions that resemble the core elements of the original motion, while simultaneously synthesizing novel and diverse movements.
We show a number of applications, including crowd simulation, key-frame editing, style transfer, and interactive control, which all learn from a single input sequence.
arXiv Detail & Related papers (2022-05-05T13:04:14Z) - Generative Adversarial Graph Convolutional Networks for Human Action
Synthesis [3.0664963196464448]
We propose Kinetic-GAN, a novel architecture to synthesise the kinetics of the human body.
The proposed adversarial architecture can condition up to 120 different actions over local and global body movements.
Our experiments were carried out in three well-known datasets.
arXiv Detail & Related papers (2021-10-21T15:01:32Z) - Hierarchical Style-based Networks for Motion Synthesis [150.226137503563]
We propose a self-supervised method for generating long-range, diverse and plausible behaviors to achieve a specific goal location.
Our proposed method learns to model the motion of human by decomposing a long-range generation task in a hierarchical manner.
On large-scale skeleton dataset, we show that the proposed method is able to synthesise long-range, diverse and plausible motion.
arXiv Detail & Related papers (2020-08-24T02:11:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.