SemanticBoost: Elevating Motion Generation with Augmented Textual Cues
- URL: http://arxiv.org/abs/2310.20323v2
- Date: Tue, 28 Nov 2023 06:18:33 GMT
- Title: SemanticBoost: Elevating Motion Generation with Augmented Textual Cues
- Authors: Xin He, Shaoli Huang, Xiaohang Zhan, Chao Weng, Ying Shan
- Abstract summary: Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD)
The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences.
Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
- Score: 73.83255805408126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current techniques face difficulties in generating motions from intricate
semantic descriptions, primarily due to insufficient semantic annotations in
datasets and weak contextual understanding. To address these issues, we present
SemanticBoost, a novel framework that tackles both challenges simultaneously.
Our framework comprises a Semantic Enhancement module and a Context-Attuned
Motion Denoiser (CAMD). The Semantic Enhancement module extracts supplementary
semantics from motion data, enriching the dataset's textual description and
ensuring precise alignment between text and motion data without depending on
large language models. On the other hand, the CAMD approach provides an
all-encompassing solution for generating high-quality, semantically consistent
motion sequences by effectively capturing context information and aligning the
generated motion with the given textual descriptions. Distinct from existing
methods, our approach can synthesize accurate orientational movements, combined
motions based on specific body part descriptions, and motions generated from
complex, extended sentences. Our experimental results demonstrate that
SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based
techniques, achieving cutting-edge performance on the Humanml3D dataset while
maintaining realistic and smooth motion generation quality.
Related papers
- Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval [4.454835029368504]
We focus on the recently introduced text-motion retrieval which aim to search for sequences that are most relevant to a natural motion description.
Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models.
We propose to investigate joint-dataset learning - where we train on multiple text-motion datasets simultaneously.
We also introduce a transformer-based motion encoder, called MoT++, which employs the specified-temporal attention to process sequences of skeleton data.
arXiv Detail & Related papers (2024-07-02T09:43:47Z) - Learning Generalizable Human Motion Generator with Reinforcement Learning [95.62084727984808]
Text-driven human motion generation is one of the vital tasks in computer-aided content creation.
Existing methods often overfit specific motion expressions in the training data, hindering their ability to generalize.
We present textbfInstructMotion, which incorporate the trail and error paradigm in reinforcement learning for generalizable human motion generation.
arXiv Detail & Related papers (2024-05-24T13:29:12Z) - Dynamic Typography: Bringing Text to Life via Video Diffusion Prior [73.72522617586593]
We present an automated text animation scheme, termed "Dynamic Typography"
It deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts.
Our technique harnesses vector graphics representations and an end-to-end optimization-based framework.
arXiv Detail & Related papers (2024-04-17T17:59:55Z) - Act As You Wish: Fine-Grained Control of Motion Diffusion Model with
Hierarchical Semantic Graphs [31.244039305932287]
We propose hierarchical semantic graphs for fine-grained control over motion generation.
We disentangle motion descriptions into hierarchical semantic graphs including three levels of motions, actions, and specifics.
Our method can continuously refine the generated motion, which may have a far-reaching impact on the community.
arXiv Detail & Related papers (2023-11-02T06:20:23Z) - Fg-T2M: Fine-Grained Text-Driven Human Motion Generation via Diffusion
Model [11.873294782380984]
We propose a fine-grained method for generating high-quality, conditional human motion sequences supporting precise text description.
Our approach consists of two key components: 1) a linguistics-structure assisted module that constructs accurate and complete language feature to fully utilize text information; and 2) a context-aware progressive reasoning module that learns neighborhood and overall semantic linguistics features from shallow and deep graph neural networks to achieve a multi-step inference.
arXiv Detail & Related papers (2023-09-12T14:43:47Z) - DiverseMotion: Towards Diverse Human Motion Generation via Discrete
Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions.
We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z) - Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion
Data and Natural Language [4.86658723641864]
We propose a novel text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural description.
Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions.
arXiv Detail & Related papers (2023-05-25T08:32:41Z) - Text-driven Video Prediction [83.04845684117835]
We propose a new task called Text-driven Video Prediction (TVP)
Taking the first frame and text caption as inputs, this task aims to synthesize the following frames.
To investigate the capability of text in causal inference for progressive motion information, our TVP framework contains a Text Inference Module (TIM)
arXiv Detail & Related papers (2022-10-06T12:43:07Z) - MoDi: Unconditional Motion Synthesis from Diverse Data [51.676055380546494]
We present MoDi, an unconditional generative model that synthesizes diverse motions.
Our model is trained in a completely unsupervised setting from a diverse, unstructured and unlabeled motion dataset.
We show that despite the lack of any structure in the dataset, the latent space can be semantically clustered.
arXiv Detail & Related papers (2022-06-16T09:06:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.