Related papers: Motion-X++: A Large-Scale Multimodal 3D Whole-body Human Motion Dataset

Motion-X++: A Large-Scale Multimodal 3D Whole-body Human Motion Dataset

URL: http://arxiv.org/abs/2501.05098v1
Date: Thu, 09 Jan 2025 09:37:27 GMT
Title: Motion-X++: A Large-Scale Multimodal 3D Whole-body Human Motion Dataset
Authors: Yuhong Zhang, Jing Lin, Ailing Zeng, Guanlin Wu, Shunlin Lu, Yurong Fu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, Lei Zhang,
Abstract summary: Motion-X++ is a large-scale multimodal 3D expressive whole-body human motion dataset.<n> Motion-X++ provides 19.5M 3D whole-body pose annotations covering 120.5K motion sequences from massive scenes.
Score: 35.47253826828815
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we introduce Motion-X++, a large-scale multimodal 3D expressive whole-body human motion dataset. Existing motion datasets predominantly capture body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions, and are typically limited to lab settings with manually labeled text descriptions, thereby restricting their scalability. To address this issue, we develop a scalable annotation pipeline that can automatically capture 3D whole-body human motion and comprehensive textural labels from RGB videos and build the Motion-X dataset comprising 81.1K text-motion pairs. Furthermore, we extend Motion-X into Motion-X++ by improving the annotation pipeline, introducing more data modalities, and scaling up the data quantities. Motion-X++ provides 19.5M 3D whole-body pose annotations covering 120.5K motion sequences from massive scenes, 80.8K RGB videos, 45.3K audios, 19.5M frame-level whole-body pose descriptions, and 120.5K sequence-level semantic labels. Comprehensive experiments validate the accuracy of our annotation pipeline and highlight Motion-X++'s significant benefits for generating expressive, precise, and natural motion with paired multimodal labels supporting several downstream tasks, including text-driven whole-body motion generation,audio-driven motion generation, 3D whole-body human mesh recovery, and 2D whole-body keypoints estimation, etc.

Related papers

FineMotion: A Dataset and Benchmark with both Spatial and Temporal Annotation for Fine-grained Motion Generation and Editing [36.42160163142448]
We propose the FineMotion dataset, which contains over 442,000 human motion snippets.<n>The dataset includes about 95k detailed paragraphs describing the movements of human body parts of entire motion sequences.
arXiv Detail & Related papers (2025-07-26T07:54:29Z)
GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation [19.2804620329011]
Generative Pretrained Multi-path Motion Model (GenM$3$) is a framework designed to learn unified motion representations. To enable large-scale training, we integrate and unify 11 high-quality motion datasets. GenM$3$ achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2025-03-19T05:56:52Z)
Motion-2-to-3: Leveraging 2D Motion Data to Boost 3D Motion Generation [43.915871360698546]
2D human videos offer a vast and accessible source of motion data, covering a wider range of styles and activities.<n>We introduce a novel framework that disentangles local joint motion from global movements, enabling efficient learning of local motion priors from 2D data.<n>Our method efficiently utilizes 2D data, supporting realistic 3D human motion generation and broadening the range of motion types it supports.
arXiv Detail & Related papers (2024-12-17T17:34:52Z)
Motion Prompting: Controlling Video Generation with Motion Trajectories [57.049252242807874]
We train a video generation model conditioned on sparse or dense video trajectories.<n>We translate high-level user requests into detailed, semi-dense motion prompts.<n>We demonstrate our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing.
arXiv Detail & Related papers (2024-12-03T18:59:56Z)
MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations [85.85596165472663]
We build MotionBank, which comprises 13 video action datasets, 1.24M motion sequences, and 132.9M frames of natural and diverse human motions. Our MotionBank is beneficial for general motion-related tasks of human motion generation, motion in-context generation, and motion understanding.
arXiv Detail & Related papers (2024-10-17T17:31:24Z)
Scaling Large Motion Models with Million-Level Human Motions [67.40066387326141]
We present MotionLib, the first million-level dataset for motion generation.<n>We train a large motion model named projname, demonstrating robust performance across a wide range of human activities.
arXiv Detail & Related papers (2024-10-04T10:48:54Z)
Holistic-Motion2D: Scalable Whole-body Human Motion Generation in 2D Space [78.95579123031733]
We present $textbfHolistic-Motion2D$, the first comprehensive and large-scale benchmark for 2D whole-body motion generation. We also highlight the utility of 2D motion for various downstream applications and its potential for lifting to 3D motion.
arXiv Detail & Related papers (2024-06-17T06:31:19Z)
Motion Generation from Fine-grained Textual Descriptions [29.033358642532722]
We build a large-scale language-motion dataset specializing in fine-grained textual descriptions, FineHumanML3D. We design a new text2motion model, FineMotionDiffuse, making full use of fine-grained textual information. Our evaluation shows that FineMotionDiffuse trained on FineHumanML3D improves FID by a large margin of 0.38, compared with competitive baselines.
arXiv Detail & Related papers (2024-03-20T11:38:30Z)
MotionScript: Natural Language Descriptions for Expressive 3D Human Motions [8.050271017133076]
We introduce MotionScript, a novel framework for generating highly detailed, natural language descriptions of 3D human motions. MotionScript provides fine-grained, structured descriptions that capture the full complexity of human movement. MotionScript serves as both a descriptive tool and a training resource for text-to-motion models.
arXiv Detail & Related papers (2023-12-19T22:33:17Z)
Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset [40.54625833855793]
Motion-X is a large-scale 3D expressive whole-body motion dataset. It comprises 15.6M precise 3D whole-body pose annotations (i.e., SMPL-X) covering 81.1K motion sequences from massive scenes. Motion-X provides 15.6M frame-level whole-body pose descriptions and 81.1K sequence-level semantic labels.
arXiv Detail & Related papers (2023-07-03T07:57:29Z)
CIRCLE: Capture In Rich Contextual Environments [69.97976304918149]
We propose a novel motion acquisition system in which the actor perceives and operates in a highly contextual virtual world. We present CIRCLE, a dataset containing 10 hours of full-body reaching motion from 5 subjects across nine scenes. We use this dataset to train a model that generates human motion conditioned on scene information.
arXiv Detail & Related papers (2023-03-31T09:18:12Z)
SportsCap: Monocular 3D Human Motion Capture and Fine-grained Understanding in Challenging Sports Videos [40.19723456533343]
We propose SportsCap -- the first approach for simultaneously capturing 3D human motions and understanding fine-grained actions from monocular challenging sports video input. Our approach utilizes the semantic and temporally structured sub-motion prior in the embedding space for motion capture and understanding. Based on such hybrid motion information, we introduce a multi-stream spatial-temporal Graph Convolutional Network(ST-GCN) to predict the fine-grained semantic action attributes.
arXiv Detail & Related papers (2021-04-23T07:52:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.