Related papers: CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos

CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos

URL: http://arxiv.org/abs/2601.10632v1
Date: Thu, 15 Jan 2026 17:52:29 GMT
Title: CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos
Authors: Chengfeng Zhao, Jiazhi Shu, Yubo Zhao, Tianyu Huang, Jiahao Lu, Zekai Gu, Chengwei Ren, Zhiyang Dou, Qing Shuai, Yuan Liu,
Abstract summary: CoMoVi is a co-generative framework that couples two video diffusion models (VDMs) to generate 3D human motions and videos synchronously within a single diffusion denoising loop.<n>In this paper, we propose an effective 2D human motion representation that can inherit the powerful prior of pre-trained VDMs.<n>We then design a dual-branch diffusion model to couple human motion and video generation process with mutual feature interaction and 3D-2D cross attentions.
Score: 34.06338037793912
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we find that the generation of 3D human motions and 2D human videos is intrinsically coupled. 3D motions provide the structural prior for plausibility and consistency in videos, while pre-trained video models offer strong generalization capabilities for motions, which necessitate coupling their generation processes. Based on this, we present CoMoVi, a co-generative framework that couples two video diffusion models (VDMs) to generate 3D human motions and videos synchronously within a single diffusion denoising loop. To achieve this, we first propose an effective 2D human motion representation that can inherit the powerful prior of pre-trained VDMs. Then, we design a dual-branch diffusion model to couple human motion and video generation process with mutual feature interaction and 3D-2D cross attentions. Moreover, we curate CoMoVi Dataset, a large-scale real-world human video dataset with text and motion annotations, covering diverse and challenging human motions. Extensive experiments demonstrate the effectiveness of our method in both 3D human motion and video generation tasks.

Related papers

EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer [64.69014756863331]
We introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion.<n>We also propose MVS-RoPE, which offers unified 3D positional encoding for both video and motion tokens.<n>Our findings reveal that explicitly representing human motion is to appearance, significantly boosting the coherence and plausibility of human-centric video generation.
arXiv Detail & Related papers (2025-12-21T17:08:14Z)
UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework [54.337290937468175]
We propose UniMo, an autoregressive model for joint modeling of 2D human videos and 3D human motions within a unified framework.<n>We show that our method simultaneously generates corresponding videos and motions while performing accurate motion capture.
arXiv Detail & Related papers (2025-12-03T16:03:18Z)
MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling [107.8379802891245]
We propose MoSA, which decouples the process of human video generation into two components, i.e. structure generation and appearance generation.<n>MoSA substantially outperforms existing approaches across the majority of evaluation metrics.<n>This paper also contributes a large-scale human video dataset, which features more complex and diverse motions than existing human video datasets.
arXiv Detail & Related papers (2025-08-24T15:20:24Z)
VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models [110.32291962407078]
VimoRAG is a video-based retrieval-augmented motion generation framework for motion large language models.<n>We develop an effective motion-centered video retrieval model and mitigate the issue of error propagation caused by suboptimal retrieval results.<n> Experimental results show that VimoRAG significantly boosts the performance of motion LLMs constrained to text-only input.
arXiv Detail & Related papers (2025-08-16T15:31:14Z)
Multi-identity Human Image Animation with Structural Video Diffusion [73.38728096088732]
emph Structural Video Diffusion is a novel framework for generating realistic multi-human videos.<n>Our approach introduces identity-specific embeddings to maintain consistent appearances across individuals.<n>We expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios.
arXiv Detail & Related papers (2025-04-05T10:03:49Z)
Move-in-2D: 2D-Conditioned Human Motion Generation [54.067588636155115]
We propose Move-in-2D, a novel approach to generate human motion sequences conditioned on a scene image.<n>Our approach accepts both a scene image and text prompt as inputs, producing a motion sequence tailored to the scene.
arXiv Detail & Related papers (2024-12-17T18:58:07Z)
HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation [64.37874983401221]
We present HumanVid, the first large-scale high-quality dataset tailored for human image animation. For the real-world data, we compile a vast collection of real-world videos from the internet. For the synthetic data, we collected 10K 3D avatar assets and leveraged existing assets of body shapes, skin textures and clothings.
arXiv Detail & Related papers (2024-07-24T17:15:58Z)
Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment [45.74813582690906]
Learning 3D human motion from 2D inputs is a fundamental task in the realms of computer vision and computer graphics. We present the Video-to-Motion Generator (VTM), which leverages motion priors through cross-modal latent feature space alignment. The VTM showcases state-of-the-art performance in reconstructing 3D human motion from monocular videos.
arXiv Detail & Related papers (2024-04-15T06:38:09Z)
Realistic Human Motion Generation with Cross-Diffusion Models [30.854425772128568]
Cross Human Motion Diffusion Model (CrossDiff) Method integrates 3D and 2D information using a shared transformer network within the training of the diffusion model. CrossDiff effectively combines the strengths of both representations to generate more realistic motion sequences.
arXiv Detail & Related papers (2023-12-18T07:44:40Z)
Action2video: Generating Videos of Human 3D Actions [31.665831044217363]
We aim to tackle the interesting yet challenging problem of generating videos of diverse and natural human motions from prescribed action categories. Key issue lies in the ability to synthesize multiple distinct motion sequences that are realistic in their visual appearances. Action2motionally generates plausible 3D pose sequences of a prescribed action category, which are processed and rendered by motion2video to form 2D videos.
arXiv Detail & Related papers (2021-11-12T20:20:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.