Related papers: Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion

Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion

URL: http://arxiv.org/abs/2404.11375v1
Date: Wed, 17 Apr 2024 13:33:09 GMT
Title: Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion
Authors: Xinghan Wang, Zixi Kang, Yadong Mu,
Abstract summary: We introduce the novel task of text-based human motion grounding (THMG) TM-Mamba is a unified model that integrates temporal global context, language query control, and spatial graph topology with only linear memory cost. For evaluation, we introduce BABEL-Grounding, the first text-motion dataset that provides detailed textual descriptions of human actions along with their corresponding temporal segments.
Score: 21.750804738752105
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human motion understanding is a fundamental task with diverse practical applications, facilitated by the availability of large-scale motion capture datasets. Recent studies focus on text-motion tasks, such as text-based motion generation, editing and question answering. In this study, we introduce the novel task of text-based human motion grounding (THMG), aimed at precisely localizing temporal segments corresponding to given textual descriptions within untrimmed motion sequences. Capturing global temporal information is crucial for the THMG task. However, transformer-based models that rely on global temporal self-attention face challenges when handling long untrimmed sequences due to the quadratic computational cost. We address these challenges by proposing Text-controlled Motion Mamba (TM-Mamba), a unified model that integrates temporal global context, language query control, and spatial graph topology with only linear memory cost. The core of the model is a text-controlled selection mechanism which dynamically incorporates global temporal information based on text query. The model is further enhanced to be topology-aware through the integration of relational embeddings. For evaluation, we introduce BABEL-Grounding, the first text-motion dataset that provides detailed textual descriptions of human actions along with their corresponding temporal segments. Extensive evaluations demonstrate the effectiveness of TM-Mamba on BABEL-Grounding.

Related papers

FrankenMotion: Part-level Human Motion Generation and Composition [41.84042766842064]
We construct a high-quality motion dataset with atomic, temporally-aware part-level text annotations.<n>Our dataset captures asynchronous and semantically distinct part movements at fine temporal resolution.<n>Based on this dataset, we introduce a diffusion-based part-aware motion generation framework, namely FrankenMotion.
arXiv Detail & Related papers (2026-01-15T23:50:07Z)
UniHM: Universal Human Motion Generation with Object Interactions in Indoor Scenes [26.71077287710599]
We propose UniHM, a unified motion language model that leverages diffusion-based generation for scene-aware human motion.<n>UniHM is the first framework to support both Text-to-Motion and Text-to-Human-Object Interaction (HOI) in complex 3D scenes.<n>Our approach introduces three key contributions: (1) a mixed-motion representation that fuses continuous 6DoF motion with discrete local motion tokens to improve motion realism; (2) a novel Look-Up-Free Quantization VAE that surpasses traditional VQ-VAEs in both reconstruction accuracy and generative performance; and (3) an enriched version of
arXiv Detail & Related papers (2025-05-19T07:02:12Z)
FTMoMamba: Motion Generation with Frequency and Text State Space Models [53.60865359814126]
We propose a novel diffusion-based FTMoMamba framework equipped with a Frequency State Space Model and a Text State Space Model. To learn fine-grained representation, FreqSSM decomposes sequences into low-frequency and high-frequency components. To ensure the consistency between text and motion, TextSSM encodes text features at the sentence level.
arXiv Detail & Related papers (2024-11-26T15:48:12Z)
KinMo: Kinematic-aware Human Motion Understanding and Generation [6.962697597686156]
Controlling human motion based on text presents an important challenge in computer vision. Traditional approaches often rely on holistic action descriptions for motion synthesis. We propose a novel motion representation that decomposes motion into distinct body joint group movements.
arXiv Detail & Related papers (2024-11-23T06:50:11Z)
Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models [12.221087476416056]
We propose Chronologically Accurate Retrieval to evaluate the chronological understanding of motion-language models. We decompose textual descriptions into events, and prepare negative text samples by shuffling the order of events in compound action descriptions. We then design a simple task for motion-language models to retrieve the more likely text from the ground truth and its chronologically shuffled version.
arXiv Detail & Related papers (2024-07-22T06:25:21Z)
HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects [86.86284624825356]
HIMO is a dataset of full-body human interacting with multiple objects. HIMO contains 3.3K 4D HOI sequences and 4.08M 3D HOI frames.
arXiv Detail & Related papers (2024-07-17T07:47:34Z)
Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval [4.454835029368504]
We focus on the recently introduced text-motion retrieval which aim to search for sequences that are most relevant to a natural motion description. Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models. We propose to investigate joint-dataset learning - where we train on multiple text-motion datasets simultaneously. We also introduce a transformer-based motion encoder, called MoT++, which employs the specified-temporal attention to process sequences of skeleton data.
arXiv Detail & Related papers (2024-07-02T09:43:47Z)
Learning Generalizable Human Motion Generator with Reinforcement Learning [95.62084727984808]
Text-driven human motion generation is one of the vital tasks in computer-aided content creation. Existing methods often overfit specific motion expressions in the training data, hindering their ability to generalize. We present textbfInstructMotion, which incorporate the trail and error paradigm in reinforcement learning for generalizable human motion generation.
arXiv Detail & Related papers (2024-05-24T13:29:12Z)
Text2Data: Low-Resource Data Generation with Textual Control [104.38011760992637]
Natural language serves as a common and straightforward control signal for humans to interact seamlessly with machines. We propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model. It undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z)
SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD) The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences. Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z)
Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language [4.86658723641864]
We propose a novel text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural description. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions.
arXiv Detail & Related papers (2023-05-25T08:32:41Z)
Local-Global Video-Text Interactions for Temporal Grounding [77.5114709695216]
This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query. We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query. The proposed method effectively predicts the target time interval by exploiting contextual information from local to global.
arXiv Detail & Related papers (2020-04-16T08:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.