FlowMotion: Target-Predictive Conditional Flow Matching for Jitter-Reduced Text-Driven Human Motion Generation
- URL: http://arxiv.org/abs/2504.01338v3
- Date: Fri, 25 Apr 2025 02:39:26 GMT
- Title: FlowMotion: Target-Predictive Conditional Flow Matching for Jitter-Reduced Text-Driven Human Motion Generation
- Authors: Manolo Canales Cuba, Vinícius do Carmo Melício, João Paulo Gois,
- Abstract summary: FlowMotion incorporates a training objective that focuses on more accurately predicting target motion in 3D human motion generation.<n>FlowMotion achieves state-of-the-art jitter performance, achieving the best jitter in the KIT dataset and the second-best jitter in the HumanML3D dataset.
- Score: 0.6554326244334868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Achieving high-fidelity and temporally smooth 3D human motion generation remains a challenge, particularly within resource-constrained environments. We introduce FlowMotion, a novel method leveraging Conditional Flow Matching (CFM). FlowMotion incorporates a training objective within CFM that focuses on more accurately predicting target motion in 3D human motion generation, resulting in enhanced generation fidelity and temporal smoothness while maintaining the fast synthesis times characteristic of flow-matching-based methods. FlowMotion achieves state-of-the-art jitter performance, achieving the best jitter in the KIT dataset and the second-best jitter in the HumanML3D dataset, and a competitive FID value in both datasets. This combination provides robust and natural motion sequences, offering a promising equilibrium between generation quality and temporal naturalness.
Related papers
- ReMoMask: Retrieval-Augmented Masked Motion Generation [8.471755159366221]
Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions.<n>We propose ReMoMask, a unified framework integrating three key innovations.<n>A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision.<n>A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts.
arXiv Detail & Related papers (2025-08-04T16:56:35Z) - MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation [16.202732894319084]
MoDiT is a novel framework that combines the 3D Morphable Model (3DMM) with a Diffusion-based Transformer.<n>Our contributions include: (i) A hierarchical denoising strategy with revised temporal attention and biased self/cross-attention mechanisms, enabling the model to refine lip synchronization.<n> (ii) The integration of 3DMM coefficients to provide explicit spatial constraints, ensuring accurate 3D-informed optical flow prediction.
arXiv Detail & Related papers (2025-07-07T15:13:46Z) - MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation [3.6669020073583756]
MOGO is a novel autoregressive framework tailored for efficient and real-time 3D motion generation.<n>MoGO comprises two key components: MoSA-VQ, a motion scale-adaptive residual vector quantization module, and RQHC-Transformer, a residual quantized hierarchical causal transformer.<n>To enhance semantic fidelity, we introduce a text condition alignment mechanism that improves motion decoding under textual control.
arXiv Detail & Related papers (2025-06-06T10:26:54Z) - SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios [48.09735396455107]
Hand-Object Interaction (HOI) generation has significant application potential.<n>Current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data.<n>We propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously.
arXiv Detail & Related papers (2025-06-03T05:04:29Z) - ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.
The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.
To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z) - SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining [62.433137130087445]
SuperFlow++ is a novel framework that integrates pretraining and downstream tasks using consecutive camera pairs.
We show that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions.
With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.
arXiv Detail & Related papers (2025-03-25T17:59:57Z) - MotionDiff: Training-free Zero-shot Interactive Motion Editing via Flow-assisted Multi-view Diffusion [20.142107033583027]
MotionDiff is a training-free zero-shot diffusion method that leverages optical flow for complex multi-view motion editing.<n>It outperforms other physics-based generative motion editing methods in achieving high-quality multi-view consistent motion results.<n>MotionDiff does not require retraining, enabling users to conveniently adapt it for various down-stream tasks.
arXiv Detail & Related papers (2025-03-22T08:32:56Z) - FRMD: Fast Robot Motion Diffusion with Consistency-Distilled Movement Primitives for Smooth Action Generation [3.7351623987275873]
We propose Fast Robot Motion Diffusion to generate smooth, temporally consistent robot motions.
Our method integrates Movement Primitives (MPs) with Consistency Models to enable efficient, single-step trajectory generation.
Our results show that FRMD generates significantly faster, smoother trajectories while achieving higher success rates.
arXiv Detail & Related papers (2025-03-03T20:56:39Z) - Motion-Aware Generative Frame Interpolation [23.380470636851022]
Flow-based frame methods ensure motion stability through estimated intermediate flow but often introduce severe artifacts in complex motion regions.<n>Recent generative approaches, boosted by large-scale pre-trained video generation models, show promise in handling intricate scenes.<n>We propose Motion-aware Generative frame (MoG) that synergizes intermediate flow guidance with generative capacities to enhance fidelity.
arXiv Detail & Related papers (2025-01-07T11:03:43Z) - A Plug-and-Play Physical Motion Restoration Approach for In-the-Wild High-Difficulty Motions [56.709280823844374]
We introduce a mask-based motion correction module (MCM) that leverages motion context and video mask to repair flawed motions.<n>We also propose a physics-based motion transfer module (PTM), which employs a pretrain and adapt approach for motion imitation.<n>Our approach is designed as a plug-and-play module to physically refine the video motion capture results, including high-difficulty in-the-wild motions.
arXiv Detail & Related papers (2024-12-23T08:26:00Z) - ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks.
We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation.
Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z) - Generalizable Implicit Motion Modeling for Video Frame Interpolation [51.966062283735596]
Motion is critical in flow-based Video Frame Interpolation (VFI)
We introduce General Implicit Motion Modeling (IMM), a novel and effective approach to motion modeling VFI.
Our GIMM can be easily integrated with existing flow-based VFI works by supplying accurately modeled motion.
arXiv Detail & Related papers (2024-07-11T17:13:15Z) - BAMM: Bidirectional Autoregressive Motion Model [14.668729995275807]
Bidirectional Autoregressive Motion Model (BAMM) is a novel text-to-motion generation framework.
BAMM consists of two key components: a motion tokenizer that transforms 3D human motion into discrete tokens in latent space, and a masked self-attention transformer that autoregressively predicts randomly masked tokens.
This feature enables BAMM to simultaneously achieving high-quality motion generation with enhanced usability and built-in motion editability.
arXiv Detail & Related papers (2024-03-28T14:04:17Z) - Spectral Motion Alignment for Video Motion Transfer using Diffusion Models [54.32923808964701]
Spectral Motion Alignment (SMA) is a framework that refines and aligns motion vectors using Fourier and wavelet transforms.<n> SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics.<n>Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.
arXiv Detail & Related papers (2024-03-22T14:47:18Z) - Seamless Human Motion Composition with Blended Positional Encodings [38.85158088021282]
We introduce FlowMDM, the first diffusion-based model that generates seamless Human Motion Compositions (HMC) without postprocessing or redundant denoising steps.
We achieve state-of-the-art results in terms of accuracy, realism, and smoothness on the Babel and HumanML3D datasets.
arXiv Detail & Related papers (2024-02-23T18:59:40Z) - ALERT-Transformer: Bridging Asynchronous and Synchronous Machine Learning for Real-Time Event-based Spatio-Temporal Data [8.660721666999718]
We propose a hybrid pipeline composed of asynchronous sensing and synchronous processing.
We achieve performances state-of-the-art with a lower latency than competitors.
arXiv Detail & Related papers (2024-02-02T13:17:19Z) - Motion Flow Matching for Human Motion Synthesis and Editing [75.13665467944314]
We propose emphMotion Flow Matching, a novel generative model for human motion generation featuring efficient sampling and effectiveness in motion editing applications.
Our method reduces the sampling complexity from thousand steps in previous diffusion models to just ten steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks.
arXiv Detail & Related papers (2023-12-14T12:57:35Z) - AAMDM: Accelerated Auto-regressive Motion Diffusion Model [10.94879097495769]
This paper introduces the Accelerated Auto-regressive Motion Diffusion Model (AAMDM)
AAMDM is a novel motion synthesis framework designed to achieve quality, diversity, and efficiency all together.
We show that AAMDM outperforms existing methods in motion quality, diversity, and runtime efficiency.
arXiv Detail & Related papers (2023-12-02T23:52:21Z) - Human MotionFormer: Transferring Human Motions with Vision Transformers [73.48118882676276]
Human motion transfer aims to transfer motions from a target dynamic person to a source static one for motion synthesis.
We propose Human MotionFormer, a hierarchical ViT framework that leverages global and local perceptions to capture large and subtle motion matching.
Experiments show that our Human MotionFormer sets the new state-of-the-art performance both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-02-22T11:42:44Z) - MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis [73.52948992990191]
MoFusion is a new denoising-diffusion-based framework for high-quality conditional human motion synthesis.
We present ways to introduce well-known kinematic losses for motion plausibility within the motion diffusion framework.
We demonstrate the effectiveness of MoFusion compared to the state of the art on established benchmarks in the literature.
arXiv Detail & Related papers (2022-12-08T18:59:48Z) - GMFlow: Learning Optical Flow via Global Matching [124.57850500778277]
We propose a GMFlow framework for learning optical flow estimation.
It consists of three main components: a customized Transformer for feature enhancement, a correlation and softmax layer for global feature matching, and a self-attention layer for flow propagation.
Our new framework outperforms 32-iteration RAFT's performance on the challenging Sintel benchmark.
arXiv Detail & Related papers (2021-11-26T18:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.