Related papers: FTMoMamba: Motion Generation with Frequency and Text State Space Models

FTMoMamba: Motion Generation with Frequency and Text State Space Models

URL: http://arxiv.org/abs/2411.17532v1
Date: Tue, 26 Nov 2024 15:48:12 GMT
Title: FTMoMamba: Motion Generation with Frequency and Text State Space Models
Authors: Chengjian Li, Xiangbo Shu, Qiongjie Cui, Yazhou Yao, Jinhui Tang,
Abstract summary: We propose a novel diffusion-based FTMoMamba framework equipped with a Frequency State Space Model and a Text State Space Model. To learn fine-grained representation, FreqSSM decomposes sequences into low-frequency and high-frequency components. To ensure the consistency between text and motion, TextSSM encodes text features at the sentence level.
Score: 53.60865359814126
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models achieve impressive performance in human motion generation. However, current approaches typically ignore the significance of frequency-domain information in capturing fine-grained motions within the latent space (e.g., low frequencies correlate with static poses, and high frequencies align with fine-grained motions). Additionally, there is a semantic discrepancy between text and motion, leading to inconsistency between the generated motions and the text descriptions. In this work, we propose a novel diffusion-based FTMoMamba framework equipped with a Frequency State Space Model (FreqSSM) and a Text State Space Model (TextSSM). Specifically, to learn fine-grained representation, FreqSSM decomposes sequences into low-frequency and high-frequency components, guiding the generation of static pose (e.g., sits, lay) and fine-grained motions (e.g., transition, stumble), respectively. To ensure the consistency between text and motion, TextSSM encodes text features at the sentence level, aligning textual semantics with sequential features. Extensive experiments show that FTMoMamba achieves superior performance on the text-to-motion generation task, especially gaining the lowest FID of 0.181 (rather lower than 0.421 of MLD) on the HumanML3D dataset.

Related papers

PackDiT: Joint Human Motion and Text Generation via Mutual Prompting [22.53146582495341]
PackDiT is the first diffusion-based generative model capable of performing various tasks simultaneously. We train PackDiT on the HumanML3D dataset, achieving state-of-the-art text-to-motion performance with an FID score of 0.106. Our experiments further demonstrate that diffusion models are effective for motion-to-text generation, achieving performance comparable to that of autoregressive models.
arXiv Detail & Related papers (2025-01-27T22:51:45Z)
MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks [30.333659816277823]
We presenttextbfMoTe, a unified multi-modal model that could handle diverse tasks by learning the marginal, conditional, and joint distributions of motion and text simultaneously. MoTe is composed of three components: Motion-Decoder (MED), Text-Decoder (TED), and Moti-on-Text Diffusion Model (MTDM)
arXiv Detail & Related papers (2024-11-29T15:48:24Z)
DART: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control [12.465927271402442]
Text-conditioned human motion generation allows for user interaction through natural language. DART is a Diffusion-based Autoregressive motion primitive model for Real-time Text-driven motion control. We present effective algorithms for both approaches, demonstrating our model's versatility and superior performance in various motion synthesis tasks.
arXiv Detail & Related papers (2024-10-07T17:58:22Z)
Seamless Human Motion Composition with Blended Positional Encodings [38.85158088021282]
We introduce FlowMDM, the first diffusion-based model that generates seamless Human Motion Compositions (HMC) without postprocessing or redundant denoising steps. We achieve state-of-the-art results in terms of accuracy, realism, and smoothness on the Babel and HumanML3D datasets.
arXiv Detail & Related papers (2024-02-23T18:59:40Z)
DiffusionPhase: Motion Diffusion in Frequency Domain [69.811762407278]
We introduce a learning-based method for generating high-quality human motion sequences from text descriptions. Existing techniques struggle with motion diversity and smooth transitions in generating arbitrary-length motion sequences. We develop a network encoder that converts the motion space into a compact yet expressive parameterized phase space.
arXiv Detail & Related papers (2023-12-07T04:39:22Z)
SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD) The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences. Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z)
Priority-Centric Human Motion Generation in Discrete Latent Space [59.401128190423535]
We introduce a Priority-Centric Motion Discrete Diffusion Model (M2DM) for text-to-motion generation. M2DM incorporates a global self-attention mechanism and a regularization term to counteract code collapse. We also present a motion discrete diffusion model that employs an innovative noise schedule, determined by the significance of each motion token.
arXiv Detail & Related papers (2023-08-28T10:40:16Z)
Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language [4.86658723641864]
We propose a novel text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural description. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions.
arXiv Detail & Related papers (2023-05-25T08:32:41Z)
MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis [73.52948992990191]
MoFusion is a new denoising-diffusion-based framework for high-quality conditional human motion synthesis. We present ways to introduce well-known kinematic losses for motion plausibility within the motion diffusion framework. We demonstrate the effectiveness of MoFusion compared to the state of the art on established benchmarks in the literature.
arXiv Detail & Related papers (2022-12-08T18:59:48Z)
Text-driven Video Prediction [83.04845684117835]
We propose a new task called Text-driven Video Prediction (TVP) Taking the first frame and text caption as inputs, this task aims to synthesize the following frames. To investigate the capability of text in causal inference for progressive motion information, our TVP framework contains a Text Inference Module (TIM)
arXiv Detail & Related papers (2022-10-06T12:43:07Z)
TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts [20.336481832461168]
Inspired by the strong ties between vision and language, our paper aims to explore the generation of 3D human full-body motions from texts. We propose the use of motion token, a discrete and compact motion representation. Our approach is flexible, could be used for both text2motion and motion2text tasks.
arXiv Detail & Related papers (2022-07-04T19:52:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.