MotionLLM: Understanding Human Behaviors from Human Motions and Videos
- URL: http://arxiv.org/abs/2405.20340v1
- Date: Thu, 30 May 2024 17:59:50 GMT
- Title: MotionLLM: Understanding Human Behaviors from Human Motions and Videos
- Authors: Ling-Hao Chen, Shunlin Lu, Ailing Zeng, Hao Zhang, Benyou Wang, Ruimao Zhang, Lei Zhang,
- Abstract summary: This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding.
We present MotionLLM, a framework for human motion understanding, captioning, and reasoning.
- Score: 40.132643319573205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding by leveraging the powerful capabilities of Large Language Models (LLMs). Diverging from recent LLMs designed for video-only or motion-only understanding, we argue that understanding human behavior necessitates joint modeling from both videos and motion sequences (e.g., SMPL sequences) to capture nuanced body part dynamics and semantics effectively. In light of this, we present MotionLLM, a straightforward yet effective framework for human motion understanding, captioning, and reasoning. Specifically, MotionLLM adopts a unified video-motion training strategy that leverages the complementary advantages of existing coarse video-text data and fine-grained motion-text data to glean rich spatial-temporal insights. Furthermore, we collect a substantial dataset, MoVid, comprising diverse videos, motions, captions, and instructions. Additionally, we propose the MoVid-Bench, with carefully manual annotations, for better evaluation of human behavior understanding on video and motion. Extensive experiments show the superiority of MotionLLM in the caption, spatial-temporal comprehension, and reasoning ability.
Related papers
- MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding [76.30210465222218]
MotionGPT-2 is a unified Large Motion-Language Model (LMLMLM)
It supports multimodal control conditions through pre-trained Large Language Models (LLMs)
It is highly adaptable to the challenging 3D holistic motion generation task.
arXiv Detail & Related papers (2024-10-29T05:25:34Z) - MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations [85.85596165472663]
We build MotionBank, which comprises 13 video action datasets, 1.24M motion sequences, and 132.9M frames of natural and diverse human motions.
Our MotionBank is beneficial for general motion-related tasks of human motion generation, motion in-context generation, and motion understanding.
arXiv Detail & Related papers (2024-10-17T17:31:24Z) - FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models [19.09048969615117]
We explore open-set human motion synthesis using natural language instructions as user control signals based on MLLMs.
Our method can achieve general human motion synthesis for many downstream tasks.
arXiv Detail & Related papers (2024-06-15T21:10:37Z) - Universal Humanoid Motion Representations for Physics-Based Control [71.46142106079292]
We present a universal motion representation that encompasses a comprehensive range of motor skills for physics-based humanoid control.
We first learn a motion imitator that can imitate all of human motion from a large, unstructured motion dataset.
We then create our motion representation by distilling skills directly from the imitator.
arXiv Detail & Related papers (2023-10-06T20:48:43Z) - DiverseMotion: Towards Diverse Human Motion Generation via Discrete
Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions.
We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z) - MotionGPT: Human Motion as a Foreign Language [47.21648303282788]
Human motion displays a semantic coupling akin to human language, often perceived as a form of body language.
By fusing language data with large-scale motion models, motion-language pre-training can enhance the performance of motion-related tasks.
We propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks.
arXiv Detail & Related papers (2023-06-26T15:53:02Z) - Self-supervised Motion Learning from Static Images [36.85209332144106]
Motion from Static Images (MoSI) learns to encode motion information.
MoSI can discover regions with large motion even without fine-tuning on the downstream datasets.
We demonstrate that MoSI can discover regions with large motion even without fine-tuning on the downstream datasets.
arXiv Detail & Related papers (2021-04-01T03:55:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.