LGTM: Local-to-Global Text-Driven Human Motion Diffusion Model
- URL: http://arxiv.org/abs/2405.03485v1
- Date: Mon, 6 May 2024 13:56:56 GMT
- Title: LGTM: Local-to-Global Text-Driven Human Motion Diffusion Model
- Authors: Haowen Sun, Ruikun Zheng, Haibin Huang, Chongyang Ma, Hui Huang, Ruizhen Hu,
- Abstract summary: We introduce LGTM, a novel Local-to-Global pipeline for Text-to-Motion generation.
It aims to address the challenge of accurately translating textual descriptions into semantically coherent human motion in computer animation.
Our experiments demonstrate that LGTM gains significant improvements in generating locally semantically-aligned human motion.
- Score: 23.864126853396527
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this paper, we introduce LGTM, a novel Local-to-Global pipeline for Text-to-Motion generation. LGTM utilizes a diffusion-based architecture and aims to address the challenge of accurately translating textual descriptions into semantically coherent human motion in computer animation. Specifically, traditional methods often struggle with semantic discrepancies, particularly in aligning specific motions to the correct body parts. To address this issue, we propose a two-stage pipeline to overcome this challenge: it first employs large language models (LLMs) to decompose global motion descriptions into part-specific narratives, which are then processed by independent body-part motion encoders to ensure precise local semantic alignment. Finally, an attention-based full-body optimizer refines the motion generation results and guarantees the overall coherence. Our experiments demonstrate that LGTM gains significant improvements in generating locally accurate, semantically-aligned human motion, marking a notable advancement in text-to-motion applications. Code and data for this paper are available at https://github.com/L-Sun/LGTM
Related papers
- Absolute Coordinates Make Motion Generation Easy [8.153961351540834]
State-of-the-art text-to-motion generation models rely on the kinematic-aware, local-relative motion representation popularized by HumanML3D.<n>We propose a radically simplified and long-abandoned alternative for text-to-motion generation: absolute joint coordinates in global space.
arXiv Detail & Related papers (2025-05-26T00:36:00Z) - UniHM: Universal Human Motion Generation with Object Interactions in Indoor Scenes [26.71077287710599]
We propose UniHM, a unified motion language model that leverages diffusion-based generation for scene-aware human motion.<n>UniHM is the first framework to support both Text-to-Motion and Text-to-Human-Object Interaction (HOI) in complex 3D scenes.<n>Our approach introduces three key contributions: (1) a mixed-motion representation that fuses continuous 6DoF motion with discrete local motion tokens to improve motion realism; (2) a novel Look-Up-Free Quantization VAE that surpasses traditional VQ-VAEs in both reconstruction accuracy and generative performance; and (3) an enriched version of
arXiv Detail & Related papers (2025-05-19T07:02:12Z) - ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment [48.894439350114396]
We propose a novel bilingual human motion dataset, BiHumanML3D, which establishes a crucial benchmark for bilingual text-to-motion generation models.<n>We also propose a Bilingual Motion Diffusion model (BiMD), which leverages cross-lingual aligned representations to capture semantics.<n>We show that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2025-05-08T06:19:18Z) - FTMoMamba: Motion Generation with Frequency and Text State Space Models [53.60865359814126]
We propose a novel diffusion-based FTMoMamba framework equipped with a Frequency State Space Model and a Text State Space Model.
To learn fine-grained representation, FreqSSM decomposes sequences into low-frequency and high-frequency components.
To ensure the consistency between text and motion, TextSSM encodes text features at the sentence level.
arXiv Detail & Related papers (2024-11-26T15:48:12Z) - KinMo: Kinematic-aware Human Motion Understanding and Generation [6.962697597686156]
Controlling human motion based on text presents an important challenge in computer vision.
Traditional approaches often rely on holistic action descriptions for motion synthesis.
We propose a novel motion representation that decomposes motion into distinct body joint group movements.
arXiv Detail & Related papers (2024-11-23T06:50:11Z) - DART: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control [12.465927271402442]
Text-conditioned human motion generation allows for user interaction through natural language.
DART is a Diffusion-based Autoregressive motion primitive model for Real-time Text-driven motion control.
We present effective algorithms for both approaches, demonstrating our model's versatility and superior performance in various motion synthesis tasks.
arXiv Detail & Related papers (2024-10-07T17:58:22Z) - Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation [52.87672306545577]
Existing motion generation methods primarily focus on the direct synthesis of global motions.
We propose the local action-guided motion diffusion model, which facilitates global motion generation by utilizing local actions as fine-grained control signals.
Our method provides flexibility in seamlessly combining various local actions and continuous guiding weight adjustment.
arXiv Detail & Related papers (2024-07-15T08:35:00Z) - Seamless Human Motion Composition with Blended Positional Encodings [38.85158088021282]
We introduce FlowMDM, the first diffusion-based model that generates seamless Human Motion Compositions (HMC) without postprocessing or redundant denoising steps.
We achieve state-of-the-art results in terms of accuracy, realism, and smoothness on the Babel and HumanML3D datasets.
arXiv Detail & Related papers (2024-02-23T18:59:40Z) - BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics [50.88842027976421]
We propose BOTH57M, a novel multi-modal dataset for two-hand motion generation.
Our dataset includes accurate motion tracking for the human body and hands.
We also provide a strong baseline method, BOTH2Hands, for the novel task.
arXiv Detail & Related papers (2023-12-13T07:30:19Z) - FG-MDM: Towards Zero-Shot Human Motion Generation via ChatGPT-Refined Descriptions [19.695991127631974]
We propose a new framework named Fine-Grained Human Motion Diffusion Model (FG-MDM) for zero-shot human motion generation.
Specifically, we first parse previous vague textual annotations into fine-grained descriptions of different body parts.
FG-MDM can generate human motions beyond the scope of original datasets owing to descriptions that are closer to motion essence.
arXiv Detail & Related papers (2023-12-05T14:01:43Z) - TLControl: Trajectory and Language Control for Human Motion Synthesis [68.09806223962323]
We present TLControl, a novel method for realistic human motion synthesis.
It incorporates both low-level Trajectory and high-level Language semantics controls.
It is practical for interactive and high-quality animation generation.
arXiv Detail & Related papers (2023-11-28T18:54:16Z) - Act As You Wish: Fine-Grained Control of Motion Diffusion Model with
Hierarchical Semantic Graphs [31.244039305932287]
We propose hierarchical semantic graphs for fine-grained control over motion generation.
We disentangle motion descriptions into hierarchical semantic graphs including three levels of motions, actions, and specifics.
Our method can continuously refine the generated motion, which may have a far-reaching impact on the community.
arXiv Detail & Related papers (2023-11-02T06:20:23Z) - Priority-Centric Human Motion Generation in Discrete Latent Space [59.401128190423535]
We introduce a Priority-Centric Motion Discrete Diffusion Model (M2DM) for text-to-motion generation.
M2DM incorporates a global self-attention mechanism and a regularization term to counteract code collapse.
We also present a motion discrete diffusion model that employs an innovative noise schedule, determined by the significance of each motion token.
arXiv Detail & Related papers (2023-08-28T10:40:16Z) - REMOT: A Region-to-Whole Framework for Realistic Human Motion Transfer [96.64111294772141]
Human Video Motion Transfer (HVMT) aims to, given an image of a source person, generate his/her video that imitates the motion of the driving person.
Existing methods for HVMT mainly exploit Generative Adversarial Networks (GANs) to perform the warping operation.
This paper presents a novel REgionto-whole human MOtion Transfer framework based on GANs.
arXiv Detail & Related papers (2022-09-01T14:03:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.