Toward Rich Video Human-Motion2D Generation
- URL: http://arxiv.org/abs/2506.14428v1
- Date: Tue, 17 Jun 2025 11:45:33 GMT
- Title: Toward Rich Video Human-Motion2D Generation
- Authors: Ruihao Xi, Xuekuan Wang, Yongcheng Li, Shuhua Li, Zichen Wang, Yiwei Wang, Feng Wei, Cairong Zhao,
- Abstract summary: We introduce a new large-scale rich video human motion 2D dataset (Motion2D-Video-150K) comprising 150,000 video sequences.<n>Motion2D-Video-150K features a balanced distribution of diverse single-character and, crucially, double-character interactive actions.<n>We propose a novel diffusion-based rich video human motion2D generation model (RVHM2D) model.
- Score: 16.58311138197227
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating realistic and controllable human motions, particularly those involving rich multi-character interactions, remains a significant challenge due to data scarcity and the complexities of modeling inter-personal dynamics. To address these limitations, we first introduce a new large-scale rich video human motion 2D dataset (Motion2D-Video-150K) comprising 150,000 video sequences. Motion2D-Video-150K features a balanced distribution of diverse single-character and, crucially, double-character interactive actions, each paired with detailed textual descriptions. Building upon this dataset, we propose a novel diffusion-based rich video human motion2D generation (RVHM2D) model. RVHM2D incorporates an enhanced textual conditioning mechanism utilizing either dual text encoders (CLIP-L/B) or T5-XXL with both global and local features. We devise a two-stage training strategy: the model is first trained with a standard diffusion objective, and then fine-tuned using reinforcement learning with an FID-based reward to further enhance motion realism and text alignment. Extensive experiments demonstrate that RVHM2D achieves leading performance on the Motion2D-Video-150K benchmark in generating both single and interactive double-character scenarios.
Related papers
- EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer [64.69014756863331]
We introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion.<n>We also propose MVS-RoPE, which offers unified 3D positional encoding for both video and motion tokens.<n>Our findings reveal that explicitly representing human motion is to appearance, significantly boosting the coherence and plausibility of human-centric video generation.
arXiv Detail & Related papers (2025-12-21T17:08:14Z) - FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos [109.99404241220039]
We introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets.<n>Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models.<n>We fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance.
arXiv Detail & Related papers (2025-12-11T18:53:15Z) - UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework [54.337290937468175]
We propose UniMo, an autoregressive model for joint modeling of 2D human videos and 3D human motions within a unified framework.<n>We show that our method simultaneously generates corresponding videos and motions while performing accurate motion capture.
arXiv Detail & Related papers (2025-12-03T16:03:18Z) - DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling [67.95038177144554]
We introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video.<n>We employ vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic captions.<n> DynamicVerse delivers a large-scale dataset consisting of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos.
arXiv Detail & Related papers (2025-12-02T18:24:27Z) - Generating Human Motion Videos using a Cascaded Text-to-Video Framework [27.77921324288557]
We propose CAMEO, a cascaded framework for general human motion video generation.<n>It seamlessly bridges Text-to-Motion (T2M) models and conditional VDMs.<n>We demonstrate the effectiveness of our approach on both the MovieGen benchmark and a newly introduced benchmark tailored to the T2M-VDM combination.
arXiv Detail & Related papers (2025-10-04T19:16:28Z) - Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos [66.62109400603394]
We introduce Being-H0, a dexterous Vision-Language-Action model trained on large-scale human videos.<n>Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks.<n>We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes.
arXiv Detail & Related papers (2025-07-21T13:19:09Z) - GENMO: A GENeralist Model for Human MOtion [64.16188966024542]
We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework.<n>Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals.<n>Our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control.
arXiv Detail & Related papers (2025-05-02T17:59:55Z) - SkyReels-V2: Infinite-length Film Generative Model [35.00453687783287]
We propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework.<n>We establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement.
arXiv Detail & Related papers (2025-04-17T16:37:27Z) - GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation [19.2804620329011]
Generative Pretrained Multi-path Motion Model (GenM$3$) is a framework designed to learn unified motion representations.<n>To enable large-scale training, we integrate and unify 11 high-quality motion datasets.<n>GenM$3$ achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2025-03-19T05:56:52Z) - AnyMoLe: Any Character Motion In-betweening Leveraging Video Diffusion Models [5.224806515926022]
We introduce AnyMoLe, a novel method to generate motion in-between frames for arbitrary characters without external data.<n>Our approach employs a two-stage frame generation process to enhance contextual understanding.
arXiv Detail & Related papers (2025-03-11T13:28:59Z) - xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens.
DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z) - M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models [18.125860678409804]
We introduce the Multi-Motion Discrete Diffusion Models (M2D2M), a novel approach for human motion generation from text descriptions.
M2D2M adeptly addresses the challenge of generating multi-motion sequences, ensuring seamless transitions of motions and coherence across a series of actions.
arXiv Detail & Related papers (2024-07-19T17:57:33Z) - Vivid-ZOO: Multi-View Video Generation with Diffusion Model [76.96449336578286]
New challenges lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution.
We propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text.
arXiv Detail & Related papers (2024-06-12T21:44:04Z) - BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics [50.88842027976421]
We propose BOTH57M, a novel multi-modal dataset for two-hand motion generation.
Our dataset includes accurate motion tracking for the human body and hands.
We also provide a strong baseline method, BOTH2Hands, for the novel task.
arXiv Detail & Related papers (2023-12-13T07:30:19Z) - Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs [112.39389727164594]
Text-to-video (T2V) synthesis has gained increasing attention in the community, in which the recently emerged diffusion models (DMs) have promisingly shown stronger performance than the past approaches.
While existing state-of-the-art DMs are competent to achieve high-resolution video generation, they may largely suffer from key limitations (e.g., action occurrence disorders, crude video motions) with respect to the temporal dynamics modeling, one of the crux of video synthesis.
In this work, we investigate strengthening awareness of video dynamics for DMs, for high-quality T2V generation
arXiv Detail & Related papers (2023-08-26T08:31:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.