Related papers: CASIM: Composite Aware Semantic Injection for Text to Motion Generation

CASIM: Composite Aware Semantic Injection for Text to Motion Generation

URL: http://arxiv.org/abs/2502.02063v1
Date: Tue, 04 Feb 2025 07:22:07 GMT
Title: CASIM: Composite Aware Semantic Injection for Text to Motion Generation
Authors: Che-Jui Chang, Qingze Tony Liu, Honglu Zhou, Vladimir Pavlovic, Mubbasir Kapadia,
Abstract summary: We propose a composite-aware semantic injection mechanism that learns the dynamic correspondence between text and motion tokens.<n> Experiments on HumanML3D and KIT benchmarks demonstrate that CASIM consistently improves motion quality, text-motion alignment, and retrieval scores.
Score: 15.53049009014166
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent advances in generative modeling and tokenization have driven significant progress in text-to-motion generation, leading to enhanced quality and realism in generated motions. However, effectively leveraging textual information for conditional motion generation remains an open challenge. We observe that current approaches, primarily relying on fixed-length text embeddings (e.g., CLIP) for global semantic injection, struggle to capture the composite nature of human motion, resulting in suboptimal motion quality and controllability. To address this limitation, we propose the Composite Aware Semantic Injection Mechanism (CASIM), comprising a composite-aware semantic encoder and a text-motion aligner that learns the dynamic correspondence between text and motion tokens. Notably, CASIM is model and representation-agnostic, readily integrating with both autoregressive and diffusion-based methods. Experiments on HumanML3D and KIT benchmarks demonstrate that CASIM consistently improves motion quality, text-motion alignment, and retrieval scores across state-of-the-art methods. Qualitative analyses further highlight the superiority of our composite-aware approach over fixed-length semantic injection, enabling precise motion control from text prompts and stronger generalization to unseen text inputs.

Related papers

Multimodal Generative AI with Autoregressive LLMs for Human Motion Understanding and Generation: A Way Forward [8.470241117250243]
This paper focuses on the use of multimodal Generative Artificial Intelligence (GenAI) and autoregressive Large Language Models (LLMs) for human motion understanding and generation.<n>It offers insights into emerging methods, architectures, and their potential to advance realistic and versatile motion synthesis.<n>This research underscores the transformative potential of text-to-motion GenAI and LLM architectures in applications such as healthcare, humanoids, gaming, animation, and assistive technologies.
arXiv Detail & Related papers (2025-05-31T11:02:24Z)
BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis [0.4893345190925178]
BiPO is a novel model that enhances text-to-motion synthesis.<n>It integrates part-based generation with a bidirectional autoregressive architecture.<n>BiPO achieves state-of-the-art performance on the HumanML3D dataset.
arXiv Detail & Related papers (2024-11-28T05:42:47Z)
MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition [94.56755080185732]
We propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information. Our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation.
arXiv Detail & Related papers (2024-05-31T08:06:05Z)
Learning Generalizable Human Motion Generator with Reinforcement Learning [95.62084727984808]
Text-driven human motion generation is one of the vital tasks in computer-aided content creation. Existing methods often overfit specific motion expressions in the training data, hindering their ability to generalize. We present textbfInstructMotion, which incorporate the trail and error paradigm in reinforcement learning for generalizable human motion generation.
arXiv Detail & Related papers (2024-05-24T13:29:12Z)
AMD:Anatomical Motion Diffusion with Interpretable Motion Decomposition and Fusion [11.689663297469945]
We propose the Adaptable Motion Diffusion model. It exploits a Large Language Model (LLM) to parse the input text into a sequence of concise and interpretable anatomical scripts. We then devise a two-branch fusion scheme that balances the influence of the input text and the anatomical scripts on the inverse diffusion process.
arXiv Detail & Related papers (2023-12-20T04:49:45Z)
SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD) The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences. Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z)
Fg-T2M: Fine-Grained Text-Driven Human Motion Generation via Diffusion Model [11.873294782380984]
We propose a fine-grained method for generating high-quality, conditional human motion sequences supporting precise text description. Our approach consists of two key components: 1) a linguistics-structure assisted module that constructs accurate and complete language feature to fully utilize text information; and 2) a context-aware progressive reasoning module that learns neighborhood and overall semantic linguistics features from shallow and deep graph neural networks to achieve a multi-step inference.
arXiv Detail & Related papers (2023-09-12T14:43:47Z)
ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer [88.61312640540902]
We introduce Explicit Synergy-based Text Spotting Transformer framework (ESTextSpotter) Our model achieves explicit synergy by modeling discriminative and interactive features for text detection and recognition within a single decoder. Experimental results demonstrate that our model significantly outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2023-08-20T03:22:23Z)
X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance [70.08635216710967]
X-Mesh is a text-driven 3D stylization framework that incorporates a novel Text-guided Dynamic Attention Module. We introduce a new standard text-mesh benchmark, MIT-30, and two automated metrics, which will enable future research to achieve fair and objective comparisons.
arXiv Detail & Related papers (2023-03-28T06:45:31Z)
Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model. During the training phase, the modality transition network is optimised by the proposed modality loss. Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.