Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data
- URL: http://arxiv.org/abs/2507.07095v1
- Date: Wed, 09 Jul 2025 17:52:04 GMT
- Title: Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data
- Authors: Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, Jingbo Wang,
- Abstract summary: We push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot.<n>We introduce MotionMillion-the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences.<n>We propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation.
- Score: 26.595803661584032
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited size of training datasets. Moreover, the lack of a comprehensive evaluation framework impedes the advancement of this task by failing to identify directions for improvement. In this work, we aim to push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot. To this end, firstly, we develop an efficient annotation pipeline and introduce MotionMillion-the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences. Additionally, we propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation. Leveraging a scalable architecture, we scale our model to 7B parameters and validate its performance on MotionMillion-Eval. Our results demonstrate strong generalization to out-of-domain and complex compositional motions, marking a significant step toward zero-shot human motion generation. The code is available at https://github.com/VankouF/MotionMillion-Codes.
Related papers
- HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation [63.04826523091837]
HY-Motion 1.0 is a series of state-of-the-art, large-scale, motion generation models capable of generating 3D human motions from textual descriptions.<n>We introduce a comprehensive, full-stage training paradigm -- including large-scale pretraining on over 3,000 hours of motion data.<n>Our model achieves the most extensive coverage, spanning over 200 motion categories across 6 major classes.
arXiv Detail & Related papers (2025-12-29T13:46:24Z) - SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control [85.91101551600978]
We show that scaling up model capacity, data, and compute yields a generalist humanoid controller capable of creating natural and robust whole-body movements.<n>We build a foundation model for motion tracking by scaling along three axes: network size, dataset volume, and compute.<n>We show the practical utility of our model through two mechanisms: (1) a real-time universal kinematic planner that bridges motion tracking to downstream task execution, enabling natural and interactive control, and (2) a unified token space that supports various motion input interfaces.
arXiv Detail & Related papers (2025-11-11T04:37:40Z) - Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model [67.8026841949812]
Being-M0.5 is the first real-time, controllable vision-language-motion model that achieves performance across multiple motion generation tasks.<n>Our approach is built upon HuMo100M, the largest and most comprehensive human motion dataset to date.<n>We introduce a novel part-aware residual quantization technique for motion tokenization that enables precise, granular control over individual body parts during generation.
arXiv Detail & Related papers (2025-08-11T11:26:10Z) - Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos [66.62109400603394]
We introduce Being-H0, a dexterous Vision-Language-Action model trained on large-scale human videos.<n>Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks.<n>We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes.
arXiv Detail & Related papers (2025-07-21T13:19:09Z) - ZeroGrasp: Zero-Shot Shape Reconstruction Enabled Robotic Grasping [40.288085021667065]
We introduce ZeroGrasp, a framework that simultaneously performs 3D reconstruction and grasp pose prediction in near real-time.<n>We evaluate ZeroGrasp on the GraspNet-1B benchmark as well as through real-world robot experiments.
arXiv Detail & Related papers (2025-04-15T04:37:39Z) - MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations [85.85596165472663]
We build MotionBank, which comprises 13 video action datasets, 1.24M motion sequences, and 132.9M frames of natural and diverse human motions.
Our MotionBank is beneficial for general motion-related tasks of human motion generation, motion in-context generation, and motion understanding.
arXiv Detail & Related papers (2024-10-17T17:31:24Z) - Scaling Large Motion Models with Million-Level Human Motions [67.40066387326141]
We present MotionLib, the first million-level dataset for motion generation.<n>We train a large motion model named projname, demonstrating robust performance across a wide range of human activities.
arXiv Detail & Related papers (2024-10-04T10:48:54Z) - Aligning Human Motion Generation with Human Perceptions [51.831338643012444]
We propose a data-driven approach to bridge the gap by introducing a large-scale human perceptual evaluation dataset, MotionPercept, and a human motion critic model, MotionCritic.<n>Our critic model offers a more accurate metric for assessing motion quality and could be readily integrated into the motion generation pipeline.
arXiv Detail & Related papers (2024-07-02T14:01:59Z) - Learning Generalizable Human Motion Generator with Reinforcement Learning [95.62084727984808]
Text-driven human motion generation is one of the vital tasks in computer-aided content creation.
Existing methods often overfit specific motion expressions in the training data, hindering their ability to generalize.
We present textbfInstructMotion, which incorporate the trail and error paradigm in reinforcement learning for generalizable human motion generation.
arXiv Detail & Related papers (2024-05-24T13:29:12Z) - FG-MDM: Towards Zero-Shot Human Motion Generation via ChatGPT-Refined Descriptions [19.695991127631974]
We propose a new framework named Fine-Grained Human Motion Diffusion Model (FG-MDM) for zero-shot human motion generation.<n> Specifically, we first parse previous vague textual annotations into fine-grained descriptions of different body parts.<n>FG-MDM can generate human motions beyond the scope of original datasets owing to descriptions that are closer to motion essence.
arXiv Detail & Related papers (2023-12-05T14:01:43Z) - Universal Humanoid Motion Representations for Physics-Based Control [71.46142106079292]
We present a universal motion representation that encompasses a comprehensive range of motor skills for physics-based humanoid control.
We first learn a motion imitator that can imitate all of human motion from a large, unstructured motion dataset.
We then create our motion representation by distilling skills directly from the imitator.
arXiv Detail & Related papers (2023-10-06T20:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.