Related papers: SemanticBoost: Elevating Motion Generation with Augmented Textual Cues

SemanticBoost: Elevating Motion Generation with Augmented Textual Cues

URL: http://arxiv.org/abs/2310.20323v2
Date: Tue, 28 Nov 2023 06:18:33 GMT
Title: SemanticBoost: Elevating Motion Generation with Augmented Textual Cues
Authors: Xin He, Shaoli Huang, Xiaohang Zhan, Chao Weng, Ying Shan
Abstract summary: Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD) The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences. Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
Score: 73.83255805408126
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current techniques face difficulties in generating motions from intricate semantic descriptions, primarily due to insufficient semantic annotations in datasets and weak contextual understanding. To address these issues, we present SemanticBoost, a novel framework that tackles both challenges simultaneously. Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD). The Semantic Enhancement module extracts supplementary semantics from motion data, enriching the dataset's textual description and ensuring precise alignment between text and motion data without depending on large language models. On the other hand, the CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences by effectively capturing context information and aligning the generated motion with the given textual descriptions. Distinct from existing methods, our approach can synthesize accurate orientational movements, combined motions based on specific body part descriptions, and motions generated from complex, extended sentences. Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques, achieving cutting-edge performance on the Humanml3D dataset while maintaining realistic and smooth motion generation quality.

Related papers

SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation [56.90807453045657]
SynMotion is a motion-customized video generation model that jointly leverages semantic guidance and visual adaptation.<n>At the semantic level, we introduce the dual-em semantic comprehension mechanism which disentangles subject and motion representations.<n>At the visual level, we integrate efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence.
arXiv Detail & Related papers (2025-06-30T10:09:32Z)
CASIM: Composite Aware Semantic Injection for Text to Motion Generation [15.53049009014166]
We propose a composite-aware semantic injection mechanism that learns the dynamic correspondence between text and motion tokens. Experiments on HumanML3D and KIT benchmarks demonstrate that CASIM consistently improves motion quality, text-motion alignment, and retrieval scores.
arXiv Detail & Related papers (2025-02-04T07:22:07Z)
KinMo: Kinematic-aware Human Motion Understanding and Generation [6.962697597686156]
Current human motion synthesis frameworks rely on global action descriptions. A single coarse description fails to capture details like variations in speed, limb positioning, and kinematic dynamics. We introduce textbfKinMo, a unified framework built on a hierarchical describable motion representation.
arXiv Detail & Related papers (2024-11-23T06:50:11Z)
Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval [4.454835029368504]
We focus on the recently introduced text-motion retrieval which aim to search for sequences that are most relevant to a natural motion description. Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models. We propose to investigate joint-dataset learning - where we train on multiple text-motion datasets simultaneously. We also introduce a transformer-based motion encoder, called MoT++, which employs the specified-temporal attention to process sequences of skeleton data.
arXiv Detail & Related papers (2024-07-02T09:43:47Z)
Learning Generalizable Human Motion Generator with Reinforcement Learning [95.62084727984808]
Text-driven human motion generation is one of the vital tasks in computer-aided content creation. Existing methods often overfit specific motion expressions in the training data, hindering their ability to generalize. We present textbfInstructMotion, which incorporate the trail and error paradigm in reinforcement learning for generalizable human motion generation.
arXiv Detail & Related papers (2024-05-24T13:29:12Z)
Dynamic Typography: Bringing Text to Life via Video Diffusion Prior [73.72522617586593]
We present an automated text animation scheme, termed "Dynamic Typography" It deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts. Our technique harnesses vector graphics representations and an end-to-end optimization-based framework.
arXiv Detail & Related papers (2024-04-17T17:59:55Z)
Text2Data: Low-Resource Data Generation with Textual Control [100.5970757736845]
Text2Data is a novel approach that utilizes unlabeled data to understand the underlying data distribution. It undergoes finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z)
Act As You Wish: Fine-Grained Control of Motion Diffusion Model with Hierarchical Semantic Graphs [31.244039305932287]
We propose hierarchical semantic graphs for fine-grained control over motion generation. We disentangle motion descriptions into hierarchical semantic graphs including three levels of motions, actions, and specifics. Our method can continuously refine the generated motion, which may have a far-reaching impact on the community.
arXiv Detail & Related papers (2023-11-02T06:20:23Z)
Fg-T2M: Fine-Grained Text-Driven Human Motion Generation via Diffusion Model [11.873294782380984]
We propose a fine-grained method for generating high-quality, conditional human motion sequences supporting precise text description. Our approach consists of two key components: 1) a linguistics-structure assisted module that constructs accurate and complete language feature to fully utilize text information; and 2) a context-aware progressive reasoning module that learns neighborhood and overall semantic linguistics features from shallow and deep graph neural networks to achieve a multi-step inference.
arXiv Detail & Related papers (2023-09-12T14:43:47Z)
DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions. We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z)
Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language [4.86658723641864]
We propose a novel text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural description. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions.
arXiv Detail & Related papers (2023-05-25T08:32:41Z)
Text-driven Video Prediction [83.04845684117835]
We propose a new task called Text-driven Video Prediction (TVP) Taking the first frame and text caption as inputs, this task aims to synthesize the following frames. To investigate the capability of text in causal inference for progressive motion information, our TVP framework contains a Text Inference Module (TIM)
arXiv Detail & Related papers (2022-10-06T12:43:07Z)
MoDi: Unconditional Motion Synthesis from Diverse Data [51.676055380546494]
We present MoDi, an unconditional generative model that synthesizes diverse motions. Our model is trained in a completely unsupervised setting from a diverse, unstructured and unlabeled motion dataset. We show that despite the lack of any structure in the dataset, the latent space can be semantically clustered.
arXiv Detail & Related papers (2022-06-16T09:06:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.