BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis
- URL: http://arxiv.org/abs/2412.00112v1
- Date: Thu, 28 Nov 2024 05:42:47 GMT
- Title: BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis
- Authors: Seong-Eun Hong, Soobin Lim, Juyeong Hwang, Minwook Chang, Hyeongyeop Kang,
- Abstract summary: BiPO is a novel model that enhances text-to-motion synthesis.
It integrates part-based generation with a bidirectional autoregressive architecture.
BiPO achieves state-of-the-art performance on the HumanML3D dataset.
- Score: 0.4893345190925178
- License:
- Abstract: Generating natural and expressive human motions from textual descriptions is challenging due to the complexity of coordinating full-body dynamics and capturing nuanced motion patterns over extended sequences that accurately reflect the given text. To address this, we introduce BiPO, Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis, a novel model that enhances text-to-motion synthesis by integrating part-based generation with a bidirectional autoregressive architecture. This integration allows BiPO to consider both past and future contexts during generation while enhancing detailed control over individual body parts without requiring ground-truth motion length. To relax the interdependency among body parts caused by the integration, we devise the Partial Occlusion technique, which probabilistically occludes the certain motion part information during training. In our comprehensive experiments, BiPO achieves state-of-the-art performance on the HumanML3D dataset, outperforming recent methods such as ParCo, MoMask, and BAMM in terms of FID scores and overall motion quality. Notably, BiPO excels not only in the text-to-motion generation task but also in motion editing tasks that synthesize motion based on partially generated motion sequences and textual descriptions. These results reveal the BiPO's effectiveness in advancing text-to-motion synthesis and its potential for practical applications.
Related papers
- Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation [19.094098673523263]
We propose a novel framework for fine-grained text-driven human motion generation.
Fg-T2M++ consists of: (1) an LLMs semantic parsing module to extract body part descriptions and semantics from text, (2) a hyperbolic text representation module to encode relational information between text units, and (3) a multi-modal fusion module to hierarchically fuse text and motion features.
arXiv Detail & Related papers (2025-02-08T11:38:12Z) - CASIM: Composite Aware Semantic Injection for Text to Motion Generation [15.53049009014166]
We propose a composite-aware semantic injection mechanism that learns the dynamic correspondence between text and motion tokens.
Experiments on HumanML3D and KIT benchmarks demonstrate that CASIM consistently improves motion quality, text-motion alignment, and retrieval scores.
arXiv Detail & Related papers (2025-02-04T07:22:07Z) - Learning Generalizable Human Motion Generator with Reinforcement Learning [95.62084727984808]
Text-driven human motion generation is one of the vital tasks in computer-aided content creation.
Existing methods often overfit specific motion expressions in the training data, hindering their ability to generalize.
We present textbfInstructMotion, which incorporate the trail and error paradigm in reinforcement learning for generalizable human motion generation.
arXiv Detail & Related papers (2024-05-24T13:29:12Z) - ParCo: Part-Coordinating Text-to-Motion Synthesis [48.67225204910634]
We propose Part-Coordinating Text-to-Motion Synthesis (ParCo)
ParCo is endowed with enhanced capabilities for understanding part motions and communication among different part motion generators.
Our approach demonstrates superior performance on common benchmarks with economic computations.
arXiv Detail & Related papers (2024-03-27T12:41:30Z) - THOR: Text to Human-Object Interaction Diffusion via Relation Intervention [51.02435289160616]
We propose a novel Text-guided Human-Object Interaction diffusion model with Relation Intervention (THOR)
In each diffusion step, we initiate text-guided human and object motion and then leverage human-object relations to intervene in object motion.
We construct Text-BEHAVE, a Text2HOI dataset that seamlessly integrates textual descriptions with the currently largest publicly available 3D HOI dataset.
arXiv Detail & Related papers (2024-03-17T13:17:25Z) - GUESS:GradUally Enriching SyntheSis for Text-Driven Human Motion
Generation [23.435588151215594]
We propose a novel cascaded diffusion-based generative framework for text-driven human motion synthesis.
The framework exploits a strategy named GradUally Enriching SyntheSis (GUESS) as its abbreviation.
We show that GUESS outperforms existing state-of-the-art methods by large margins in terms of accuracy, realisticness, and diversity.
arXiv Detail & Related papers (2024-01-04T08:48:21Z) - AMD:Anatomical Motion Diffusion with Interpretable Motion Decomposition
and Fusion [11.689663297469945]
We propose the Adaptable Motion Diffusion model.
It exploits a Large Language Model (LLM) to parse the input text into a sequence of concise and interpretable anatomical scripts.
We then devise a two-branch fusion scheme that balances the influence of the input text and the anatomical scripts on the inverse diffusion process.
arXiv Detail & Related papers (2023-12-20T04:49:45Z) - SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD)
The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences.
Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z) - SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation [58.25766404147109]
Our goal is to synthesize 3D human motions given textual inputs describing simultaneous actions.
We refer to generating such simultaneous movements as performing'spatial compositions'
arXiv Detail & Related papers (2023-04-20T16:01:55Z) - Text-driven Video Prediction [83.04845684117835]
We propose a new task called Text-driven Video Prediction (TVP)
Taking the first frame and text caption as inputs, this task aims to synthesize the following frames.
To investigate the capability of text in causal inference for progressive motion information, our TVP framework contains a Text Inference Module (TIM)
arXiv Detail & Related papers (2022-10-06T12:43:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.