Related papers: Large Motion Model for Unified Multi-Modal Motion Generation

Large Motion Model for Unified Multi-Modal Motion Generation

URL: http://arxiv.org/abs/2404.01284v1
Date: Mon, 1 Apr 2024 17:55:11 GMT
Title: Large Motion Model for Unified Multi-Modal Motion Generation
Authors: Mingyuan Zhang, Daisheng Jin, Chenyang Gu, Fangzhou Hong, Zhongang Cai, Jingfang Huang, Chongzhi Zhang, Xinying Guo, Lei Yang, Ying He, Ziwei Liu,
Abstract summary: Large Motion Model (LMM) is a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model. LMM tackles these challenges from three principled aspects.
Score: 50.56268006354396
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human motion generation, a cornerstone technique in animation and video production, has widespread applications in various tasks like text-to-motion and music-to-dance. Previous works focus on developing specialist models tailored for each task without scalability. In this work, we present Large Motion Model (LMM), a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model. A unified motion model is appealing since it can leverage a wide range of motion data to achieve broad generalization beyond a single task. However, it is also challenging due to the heterogeneous nature of substantially different motion data and tasks. LMM tackles these challenges from three principled aspects: 1) Data: We consolidate datasets with different modalities, formats and tasks into a comprehensive yet unified motion generation dataset, MotionVerse, comprising 10 tasks, 16 datasets, a total of 320k sequences, and 100 million frames. 2) Architecture: We design an articulated attention mechanism ArtAttention that incorporates body part-aware modeling into Diffusion Transformer backbone. 3) Pre-Training: We propose a novel pre-training strategy for LMM, which employs variable frame rates and masking forms, to better exploit knowledge from diverse training data. Extensive experiments demonstrate that our generalist LMM achieves competitive performance across various standard motion generation tasks over state-of-the-art specialist models. Notably, LMM exhibits strong generalization capabilities and emerging properties across many unseen tasks. Additionally, our ablation studies reveal valuable insights about training and scaling up large motion models for future research.

Related papers

GENMO: A GENeralist Model for Human MOtion [64.16188966024542]
We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework.<n>Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals.<n>Our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control.
arXiv Detail & Related papers (2025-05-02T17:59:55Z)
PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model [23.768571323272152]
PartRM is a novel 4D reconstruction framework that simultaneously models appearance, geometry, and part-level motion from multi-view images of a static object. We introduce the PartDrag-4D dataset, providing multi-view observations of part-level dynamics across over 20,000 states. Experimental results show that PartRM establishes a new state-of-the-art in part-level motion learning and can be applied in manipulation tasks in robotics.
arXiv Detail & Related papers (2025-03-25T17:59:58Z)
GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation [19.2804620329011]
Generative Pretrained Multi-path Motion Model (GenM$3$) is a framework designed to learn unified motion representations. To enable large-scale training, we integrate and unify 11 high-quality motion datasets. GenM$3$ achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2025-03-19T05:56:52Z)
MoFM: A Large-Scale Human Motion Foundation Model [2.621434923709917]
Foundation Models (FMs) have increasingly drawn the attention of researchers due to their scalability and generalization across diverse tasks. MoFM is designed for the semantic understanding of complex human motions in both time and space. MoFM provides a backbone to diverse downstream tasks, supporting paradigms such as one-shot, unsupervised, and supervised tasks.
arXiv Detail & Related papers (2025-02-08T03:42:52Z)
SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing. It is designed to accurately detect horizontal or oriented objects from any sensor modality. This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z)
Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models [70.78051873517285]
We present MotionBase, the first million-level motion generation benchmark. By leveraging this vast dataset, our large motion model demonstrates strong performance across a broad range of motions. We introduce a novel 2D lookup-free approach for motion tokenization, which preserves motion information and expands codebook capacity.
arXiv Detail & Related papers (2024-10-04T10:48:54Z)
MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls [30.487510829107908]
We propose MotionCraft, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control. Our framework employs a coarse-to-fine training strategy, starting with the first stage of text-to-motion semantic pre-training. We introduce MC-Bench, the first available multimodal whole-body motion generation benchmark based on the unified SMPL-X format.
arXiv Detail & Related papers (2024-07-30T18:57:06Z)
ProMotion: Prototypes As Motion Learners [46.08051377180652]
We introduce ProMotion, a unified prototypical framework engineered to model fundamental motion tasks. ProMotion offers a range of compelling attributes that set it apart from current task-specific paradigms. We capitalize on a dual mechanism involving the feature denoiser and the prototypical learner to decipher the intricacies of motion.
arXiv Detail & Related papers (2024-06-07T15:10:33Z)
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation [61.392147185793476]
We present a unified and versatile foundation model, namely, SEED-X. SEED-X is able to model multi-granularity visual semantics for comprehension and generation tasks. We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.
arXiv Detail & Related papers (2024-04-22T17:56:09Z)
DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions. We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z)
An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently. Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z)
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video [89.19867891570945]
mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining. It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
arXiv Detail & Related papers (2023-02-01T12:40:03Z)
Dynamic Future Net: Diversified Human Motion Generation [31.987602940970888]
Human motion modelling is crucial in many areas such as computer graphics, vision and virtual reality. We present Dynamic Future Net, a new deep learning model where we explicitly focuses on the intrinsic motionity of human motion dynamics. Our model can generate a large number of high-quality motions with arbitrary duration, and visuallyincing variations in both space and time.
arXiv Detail & Related papers (2020-08-25T02:31:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.