The Quest for Generalizable Motion Generation: Data, Model, and Evaluation
- URL: http://arxiv.org/abs/2510.26794v1
- Date: Thu, 30 Oct 2025 17:59:27 GMT
- Title: The Quest for Generalizable Motion Generation: Data, Model, and Evaluation
- Authors: Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qingping Sun, Zhongang Cai, Lei Yang, Ziwei Liu,
- Abstract summary: We present a framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation.<n>First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples.<n>Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning.<n>Third, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability.
- Score: 66.57596758773309
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.
Related papers
- MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation [48.84450712826316]
MSVBench is the first comprehensive benchmark featuring hierarchical scripts and reference images tailored for Multi-Shot Video generation.<n>We propose a hybrid evaluation framework that synergizes the high-level semantic reasoning of Large Multimodal Models with the fine-grained perceptual rigor of domain-specific expert models.
arXiv Detail & Related papers (2026-02-27T12:26:34Z) - HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation [63.04826523091837]
HY-Motion 1.0 is a series of state-of-the-art, large-scale, motion generation models capable of generating 3D human motions from textual descriptions.<n>We introduce a comprehensive, full-stage training paradigm -- including large-scale pretraining on over 3,000 hours of motion data.<n>Our model achieves the most extensive coverage, spanning over 200 motion categories across 6 major classes.
arXiv Detail & Related papers (2025-12-29T13:46:24Z) - IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation [54.36300724708094]
Assessment and refinement tasks act as crucial bridges to enable bidirectional knowledge flow between understanding and generation.<n>We introduce IRG-MotionLLM, the first model that seamlessly interleaves motion generation, assessment, and refinement to improve generation performance.
arXiv Detail & Related papers (2025-12-11T15:16:06Z) - HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies [83.41714103649751]
Development of embodied intelligence models depends on access to high-quality robot demonstration data.<n>We present HiMoE-VLA, a novel vision-language-action framework tailored to handle diverse robotic data with heterogeneity.<n>HiMoE-VLA demonstrates a consistent performance boost over existing VLA baselines, achieving higher accuracy and robust generalizations.
arXiv Detail & Related papers (2025-12-05T13:21:05Z) - GENMO: A GENeralist Model for Human MOtion [64.16188966024542]
We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework.<n>Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals.<n>Our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control.
arXiv Detail & Related papers (2025-05-02T17:59:55Z) - GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation [19.2804620329011]
Generative Pretrained Multi-path Motion Model (GenM(3)) is a comprehensive framework designed to learn unified motion representations.<n>To enable large-scale training, we integrate and unify 11 high-quality motion datasets.<n>GenM(3) achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2025-03-19T05:56:52Z) - WeGen: A Unified Model for Interactive Multimodal Generation as We Chat [51.78489661490396]
We introduce WeGen, a model that unifies multimodal generation and understanding.<n>It can generate diverse results with high creativity for less detailed instructions.<n>We show it achieves state-of-the-art performance across various visual generation benchmarks.
arXiv Detail & Related papers (2025-03-03T02:50:07Z) - MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model [87.71060849866093]
We introduce MVGenMaster, a multi-view diffusion model enhanced with 3D priors to address versatile Novel View Synthesis (NVS) tasks.<n>Our model features a simple yet effective pipeline that can generate up to 100 novel views conditioned on variable reference views and camera poses.<n>We present several training and model modifications to strengthen the model with scaled-up datasets.
arXiv Detail & Related papers (2024-11-25T07:34:23Z) - G-NeuroDAVIS: A Neural Network model for generalized embedding, data visualization and sample generation [0.0]
A novel generative model, called G-NeuroDAVIS, is capable of visualizing high-dimensional data through a generalized embedding.
G-NeuroDAVIS can be trained in both supervised and unsupervised settings.
arXiv Detail & Related papers (2024-10-18T07:14:08Z) - Scaling Large Motion Models with Million-Level Human Motions [67.40066387326141]
We present MotionLib, the first million-level dataset for motion generation.<n>We train a large motion model named projname, demonstrating robust performance across a wide range of human activities.
arXiv Detail & Related papers (2024-10-04T10:48:54Z) - Large Motion Model for Unified Multi-Modal Motion Generation [50.56268006354396]
Large Motion Model (LMM) is a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model.
LMM tackles these challenges from three principled aspects.
arXiv Detail & Related papers (2024-04-01T17:55:11Z) - ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model [33.64263969970544]
3D human motion generation is crucial for creative industry.
Recent advances rely on generative models with domain knowledge for text-driven motion generation.
We propose ReMoDiffuse, a diffusion-model-based motion generation framework.
arXiv Detail & Related papers (2023-04-03T16:29:00Z) - Hierarchical Graph-Convolutional Variational AutoEncoding for Generative
Modelling of Human Motion [1.2599533416395767]
Models of human motion commonly focus either on trajectory prediction or action classification but rarely both.
Here we propose a novel architecture based on hierarchical variational autoencoders and deep graph convolutional neural networks for generating a holistic model of action over multiple time-scales.
We show this Hierarchical Graph-conational Varivolutional Autoencoder (HG-VAE) to be capable of generating coherent actions, detecting out-of-distribution data, and imputing missing data by gradient ascent on the model's posterior.
arXiv Detail & Related papers (2021-11-24T16:21:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.