Related papers: M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation

M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation

URL: http://arxiv.org/abs/2405.16273v3
Date: Wed, 29 May 2024 11:46:57 GMT
Title: M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation
Authors: Mingshuang Luo, Ruibing Hou, Hong Chang, Zimo Liu, Yaowei Wang, Shiguang Shan,
Abstract summary: M$3$GPT is an advanced $textbfM$ultimodal, $textbfM$ultitask framework for comprehension and generation. We employ discrete vector quantization for multimodal control and generation signals, such as text, music and motion/dance. M$3$GPT learns to model the connections and synergies among various motion-relevant tasks.
Score: 80.20191044840564
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents M$^3$GPT, an advanced $\textbf{M}$ultimodal, $\textbf{M}$ultitask framework for $\textbf{M}$otion comprehension and generation. M$^3$GPT operates on three fundamental principles. The first focuses on creating a unified representation space for various motion-relevant modalities. We employ discrete vector quantization for multimodal control and generation signals, such as text, music and motion/dance, enabling seamless integration into a large language model (LLM) with a single vocabulary. The second involves modeling model generation directly in the raw motion space. This strategy circumvents the information loss associated with discrete tokenizer, resulting in more detailed and comprehensive model generation. Third, M$^3$GPT learns to model the connections and synergies among various motion-relevant tasks. Text, the most familiar and well-understood modality for LLMs, is utilized as a bridge to establish connections between different motion tasks, facilitating mutual reinforcement. To our knowledge, M$^3$GPT is the first model capable of comprehending and generating motions based on multiple signals. Extensive experiments highlight M$^3$GPT's superior performance across various motion-relevant tasks and its powerful zero-shot generalization capabilities for extremely challenging tasks.

Related papers

EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation [8.214084596349744]
EchoMimicV3 is an efficient framework that unifies multi-task and multi-modal human animation.<n>With a minimal model size of 1.3 billion parameters, EchoMimicV3 achieves competitive performance in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-07-05T05:36:26Z)
GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation [19.2804620329011]
Generative Pretrained Multi-path Motion Model (GenM$3$) is a framework designed to learn unified motion representations. To enable large-scale training, we integrate and unify 11 high-quality motion datasets. GenM$3$ achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2025-03-19T05:56:52Z)
SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing. It is designed to accurately detect horizontal or oriented objects from any sensor modality. This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z)
MotionLLaMA: A Unified Framework for Motion Synthesis and Comprehension [26.172040706657235]
MotionLLaMA is a unified framework for motion synthesis and comprehension. The HoMi Tokenizer is a novel full-body motion tokenizer. MotionLLaMA achieves state-of-the-art (SOTA) performance in motion completion, interaction dual-person text-to-motion, and all comprehension tasks.
arXiv Detail & Related papers (2024-11-26T11:28:01Z)
MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding [76.30210465222218]
MotionGPT-2 is a unified Large Motion-Language Model (LMLMLM) It supports multimodal control conditions through pre-trained Large Language Models (LLMs) It is highly adaptable to the challenging 3D holistic motion generation task.
arXiv Detail & Related papers (2024-10-29T05:25:34Z)
MIO: A Foundation Model on Multimodal Tokens [74.85153216521945]
We introduce MIO, a novel foundation model built on multimodal tokens. MIO is capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner.
arXiv Detail & Related papers (2024-09-26T09:57:16Z)
MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls [30.487510829107908]
We propose MotionCraft, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control. Our framework employs a coarse-to-fine training strategy, starting with the first stage of text-to-motion semantic pre-training. We introduce MC-Bench, the first available multimodal whole-body motion generation benchmark based on the unified SMPL-X format.
arXiv Detail & Related papers (2024-07-30T18:57:06Z)
Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs. Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction. We benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z)
Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs [67.59291068131438]
Motion-Agent is a conversational framework designed for general human motion generation, editing, and understanding. Motion-Agent employs an open-source pre-trained language model to develop a generative agent, MotionLLM, that bridges the gap between motion and text.
arXiv Detail & Related papers (2024-05-27T09:57:51Z)
Large Motion Model for Unified Multi-Modal Motion Generation [50.56268006354396]
Large Motion Model (LMM) is a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model. LMM tackles these challenges from three principled aspects.
arXiv Detail & Related papers (2024-04-01T17:55:11Z)
MoConVQ: Unified Physics-Based Motion Control via Scalable Discrete Representations [25.630268570049708]
MoConVQ is a novel unified framework for physics-based motion control leveraging scalable discrete representations. Our approach effectively learns motion embeddings from a large, unstructured dataset spanning tens of hours of motion examples.
arXiv Detail & Related papers (2023-10-16T09:09:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.