Related papers: Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow

Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow

URL: http://arxiv.org/abs/2509.24099v2
Date: Mon, 13 Oct 2025 18:22:56 GMT
Title: Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow
Authors: Prerit Gupta, Shourya Verma, Ananth Grama, Aniket Bera,
Abstract summary: We introduce DualFlow, a framework for multi-modal two-person motion generation.<n>It synthesises motion on diverse inputs, including text, music, and prior motion sequences.<n>It produces temporally coherent and rhythmically synchronized motions, setting state-of-the-art in multi-modal human motion generation.
Score: 17.95248351806955
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating realistic, context-aware two-person motion conditioned on diverse modalities remains a central challenge in computer graphics, animation, and human-computer interaction. We introduce DualFlow, a unified and efficient framework for multi-modal two-person motion generation. DualFlow conditions 3D motion synthesis on diverse inputs, including text, music, and prior motion sequences. Leveraging rectified flow, it achieves deterministic straight-line sampling paths between noise and data, reducing inference time and mitigating error accumulation common in diffusion-based models. To enhance semantic grounding, DualFlow employs a Retrieval-Augmented Generation (RAG) module that retrieves motion exemplars using music features and LLM-based text decompositions of spatial relations, body movements, and rhythmic patterns. We use contrastive objective that further strengthens alignment with conditioning signals and introduce synchronization loss that improves inter-person coordination. Extensive evaluations across text-to-motion, music-to-motion, and multi-modal interactive benchmarks show consistent gains in motion quality, responsiveness, and efficiency. DualFlow produces temporally coherent and rhythmically synchronized motions, setting state-of-the-art in multi-modal human motion generation.

Related papers

T2M Mamba: Motion Periodicity-Saliency Coupling Approach for Stable Text-Driven Motion Generation [3.6564162676635363]
Text-to-motion generation has attracted increasing attention in fields such as avatar animation and humanoid robotic interaction.<n>Models treat motion periodicity and saliency as independent factors, overlooking their coupling and causing generation drift in long sequences.<n>We propose T2M Mamba to address these limitations by (i) proposing Periodicity-Saliency Aware Mamba.
arXiv Detail & Related papers (2026-02-01T17:42:53Z)
EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer [64.69014756863331]
We introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion.<n>We also propose MVS-RoPE, which offers unified 3D positional encoding for both video and motion tokens.<n>Our findings reveal that explicitly representing human motion is to appearance, significantly boosting the coherence and plausibility of human-centric video generation.
arXiv Detail & Related papers (2025-12-21T17:08:14Z)
OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation [52.579531290307926]
This paper introduces OmniMotion-X, a versatile framework for whole-body human motion generation.<n> OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture.<n>To enable high-quality multimodal training, we construct OmniMoCap-X, the largest unified multimodal motion dataset to date.
arXiv Detail & Related papers (2025-10-22T17:25:33Z)
MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing [53.98607267063729]
MotionVerse is a framework to comprehend, generate, and edit human motion in both single-person and multi-person scenarios.<n>We employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens.<n>We also introduce a textitDelay Parallel Modeling strategy, which temporally staggers the encoding of residual token streams.
arXiv Detail & Related papers (2025-09-28T04:20:56Z)
M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation [65.48046909056468]
We reformulate talking head generation into a unified framework comprising video preprocessing, motion representation, and rendering reconstruction.<n>M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness.
arXiv Detail & Related papers (2025-07-11T04:48:12Z)
MotionGPT3: Human Motion as a Second Modality [28.616340011811843]
MotionGPT3 is a bimodal motion-language model for both understanding and generation.<n>A dual-stream Transformer with shared attention preserves modality-specific routes while enabling controlled, bidirectional information flow.<n>Experiments show that MotionGPT3 achieves 2x faster convergence in training loss and up to 4x faster convergence in validation.
arXiv Detail & Related papers (2025-06-30T17:42:22Z)
AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars [65.53676584955686]
Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans.<n>We propose AsynFusion, a novel framework that leverages diffusion transformers to achieve cohesive expression and gesture synthesis.<n>AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations.
arXiv Detail & Related papers (2025-05-21T03:28:53Z)
FlowMotion: Target-Predictive Conditional Flow Matching for Jitter-Reduced Text-Driven Human Motion Generation [0.6554326244334868]
FlowMotion incorporates a training objective that focuses on more accurately predicting target motion in 3D human motion generation.<n>FlowMotion achieves state-of-the-art jitter performance, achieving the best jitter in the KIT dataset and the second-best jitter in the HumanML3D dataset.
arXiv Detail & Related papers (2025-04-02T03:55:21Z)
DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions. We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.