Related papers: GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation

GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation

URL: http://arxiv.org/abs/2503.14919v1
Date: Wed, 19 Mar 2025 05:56:52 GMT
Title: GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation
Authors: Junyu Shi, Lijiang Liu, Yong Sun, Zhiyuan Zhang, Jinni Zhou, Qiang Nie,
Abstract summary: Generative Pretrained Multi-path Motion Model (GenM$3$) is a framework designed to learn unified motion representations.<n>To enable large-scale training, we integrate and unify 11 high-quality motion datasets.<n>GenM$3$ achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin.
Score: 19.2804620329011
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling up motion datasets is crucial to enhance motion generation capabilities. However, training on large-scale multi-source datasets introduces data heterogeneity challenges due to variations in motion content. To address this, we propose Generative Pretrained Multi-path Motion Model (GenM$^3$), a comprehensive framework designed to learn unified motion representations. GenM$^3$ comprises two components: 1) a Multi-Expert VQ-VAE (MEVQ-VAE) that adapts to different dataset distributions to learn a unified discrete motion representation, and 2) a Multi-path Motion Transformer (MMT) that improves intra-modal representations by using separate modality-specific pathways, each with densely activated experts to accommodate variations within that modality, and improves inter-modal alignment by the text-motion shared pathway. To enable large-scale training, we integrate and unify 11 high-quality motion datasets (approximately 220 hours of motion data) and augment it with textual annotations (nearly 10,000 motion sequences labeled by a large language model and 300+ by human experts). After training on our integrated dataset, GenM$^3$ achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin. It also demonstrates strong zero-shot generalization on IDEA400 dataset, highlighting its effectiveness and adaptability across diverse motion scenarios.

Related papers

HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation [63.04826523091837]
HY-Motion 1.0 is a series of state-of-the-art, large-scale, motion generation models capable of generating 3D human motions from textual descriptions.<n>We introduce a comprehensive, full-stage training paradigm -- including large-scale pretraining on over 3,000 hours of motion data.<n>Our model achieves the most extensive coverage, spanning over 200 motion categories across 6 major classes.
arXiv Detail & Related papers (2025-12-29T13:46:24Z)
The Quest for Generalizable Motion Generation: Data, Model, and Evaluation [66.57596758773309]
We present a framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation.<n>First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples.<n>Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning.<n>Third, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability.
arXiv Detail & Related papers (2025-10-30T17:59:27Z)
OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation [52.579531290307926]
This paper introduces OmniMotion-X, a versatile framework for whole-body human motion generation.<n> OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture.<n>To enable high-quality multimodal training, we construct OmniMoCap-X, the largest unified multimodal motion dataset to date.
arXiv Detail & Related papers (2025-10-22T17:25:33Z)
EMMA: Generalizing Real-World Robot Manipulation via Generative Visual Transfer [35.27100635173712]
Vision-language-action (VLA) models increasingly rely on diverse training data to achieve robust generalization.<n>We propose Embodied Manipulation Media Adaptation (EMMA), a VLA policy enhancement framework that integrates a generative data engine with an effective training pipeline.<n>DreamTransfer enables text-controlled visual editing of robot videos, transforming foreground, background, and lighting conditions without compromising 3D structure or geometrical plausibility.<n>AdaMix is a hard-sample-aware training strategy that dynamically reweights training batches to focus optimization on perceptually or kinematically challenging samples.
arXiv Detail & Related papers (2025-09-26T14:34:44Z)
Toward Rich Video Human-Motion2D Generation [16.58311138197227]
We introduce a new large-scale rich video human motion 2D dataset (Motion2D-Video-150K) comprising 150,000 video sequences.<n>Motion2D-Video-150K features a balanced distribution of diverse single-character and, crucially, double-character interactive actions.<n>We propose a novel diffusion-based rich video human motion2D generation model (RVHM2D) model.
arXiv Detail & Related papers (2025-06-17T11:45:33Z)
GENMO: A GENeralist Model for Human MOtion [64.16188966024542]
We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework.<n>Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals.<n>Our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control.
arXiv Detail & Related papers (2025-05-02T17:59:55Z)
MTGS: Multi-Traversal Gaussian Splatting [51.22657444433942]
Multi-traversal data provides multiple viewpoints for scene reconstruction within a road block. We propose Multi-Traversal Gaussian Splatting (MTGS), a novel approach that reconstructs high-quality driving scenes from arbitrarily collected multi-traversal data. Our results demonstrate that MTGS improves LPIPS by 23.5% and geometry accuracy by 46.3% compared to single-traversal baselines.
arXiv Detail & Related papers (2025-03-16T15:46:12Z)
LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving [52.83707400688378]
LargeAD is a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets.<n>Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples.<n>Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection.
arXiv Detail & Related papers (2025-01-07T18:59:59Z)
Multi-Transmotion: Pre-trained Model for Human Motion Prediction [68.87010221355223]
Multi-Transmotion is an innovative transformer-based model designed for cross-modality pre-training. Our methodology demonstrates competitive performance across various datasets on several downstream tasks.
arXiv Detail & Related papers (2024-11-04T23:15:21Z)
MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls [30.487510829107908]
We propose MotionCraft, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control. Our framework employs a coarse-to-fine training strategy, starting with the first stage of text-to-motion semantic pre-training. We introduce MC-Bench, the first available multimodal whole-body motion generation benchmark based on the unified SMPL-X format.
arXiv Detail & Related papers (2024-07-30T18:57:06Z)
Towards Open Domain Text-Driven Synthesis of Multi-Person Motions [36.737740727883924]
We curate human pose and motion datasets by estimating pose information from large-scale image and video datasets. Our method is the first to generate multi-subject motion sequences with high diversity and fidelity from a large variety of textual prompts.
arXiv Detail & Related papers (2024-05-28T18:00:06Z)
M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation [78.77004913030285]
M$3$GPT is an advanced $textbfM$ultimodal, $textbfM$ultitask framework for comprehension and generation. We employ discrete vector quantization for multimodal conditional signals, such as text, music and motion/dance, enabling seamless integration into a large language model. M$3$GPT learns to model the connections and synergies among various motion-relevant tasks.
arXiv Detail & Related papers (2024-05-25T15:21:59Z)
Large Motion Model for Unified Multi-Modal Motion Generation [50.56268006354396]
Large Motion Model (LMM) is a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model. LMM tackles these challenges from three principled aspects.
arXiv Detail & Related papers (2024-04-01T17:55:11Z)
Scaling Up Dynamic Human-Scene Interaction Modeling [58.032368564071895]
TRUMANS is the most comprehensive motion-captured HSI dataset currently available. It intricately captures whole-body human motions and part-level object dynamics. We devise a diffusion-based autoregressive model that efficiently generates HSI sequences of any length.
arXiv Detail & Related papers (2024-03-13T15:45:04Z)
DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions. We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.