Related papers: FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models

FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models

URL: http://arxiv.org/abs/2406.10740v3
Date: Mon, 11 Nov 2024 17:54:55 GMT
Title: FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models
Authors: Zhikai Zhang, Yitang Li, Haofeng Huang, Mingxian Lin, Li Yi,
Abstract summary: We explore open-set human motion synthesis using natural language instructions as user control signals based on MLLMs. Our method can achieve general human motion synthesis for many downstream tasks.
Score: 19.09048969615117
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human motion synthesis is a fundamental task in computer animation. Despite recent progress in this field utilizing deep learning and motion capture data, existing methods are always limited to specific motion categories, environments, and styles. This poor generalizability can be partially attributed to the difficulty and expense of collecting large-scale and high-quality motion data. At the same time, foundation models trained with internet-scale image and text data have demonstrated surprising world knowledge and reasoning ability for various downstream tasks. Utilizing these foundation models may help with human motion synthesis, which some recent works have superficially explored. However, these methods didn't fully unveil the foundation models' potential for this task and only support several simple actions and environments. In this paper, we for the first time, without any motion data, explore open-set human motion synthesis using natural language instructions as user control signals based on MLLMs across any motion task and environment. Our framework can be split into two stages: 1) sequential keyframe generation by utilizing MLLMs as a keyframe designer and animator; 2) motion filling between keyframes through interpolation and motion tracking. Our method can achieve general human motion synthesis for many downstream tasks. The promising results demonstrate the worth of mocap-free human motion synthesis aided by MLLMs and pave the way for future research.

Related papers

FUSION: Full-Body Unified Motion Prior for Body and Hands via Diffusion [49.026972478098266]
Hands are central to interacting with our surroundings and conveying gestures.<n>Existing human motion synthesis methods fall short.<n>Key obstacle is the lack of large-scale datasets that jointly capture diverse full-body motion.
arXiv Detail & Related papers (2026-01-07T14:18:59Z)
Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy [30.43930233035367]
We introduce the first unified physics-based HO framework that leverages Vision-Language Models (VLMs)<n>We introduce VLM-Guided Relative Movement Dynamics (RMD), a fine-temporal bipartite motion representation that automatically constructs goal states and reward functions for reinforcement learning.<n>To support our methodology, we present Interplay, a novel dataset with thousands of long-horizon static and dynamic interaction plans.
arXiv Detail & Related papers (2025-03-24T05:18:04Z)
Scaling Large Motion Models with Million-Level Human Motions [67.40066387326141]
We present MotionLib, the first million-level dataset for motion generation.<n>We train a large motion model named projname, demonstrating robust performance across a wide range of human activities.
arXiv Detail & Related papers (2024-10-04T10:48:54Z)
MotionLLM: Understanding Human Behaviors from Human Motions and Videos [40.132643319573205]
This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding. We present MotionLLM, a framework for human motion understanding, captioning, and reasoning.
arXiv Detail & Related papers (2024-05-30T17:59:50Z)
ReMoS: 3D Motion-Conditioned Reaction Synthesis for Two-Person Interactions [66.87211993793807]
We present ReMoS, a denoising diffusion based model that synthesizes full body motion of a person in two person interaction scenario. We demonstrate ReMoS across challenging two person scenarios such as pair dancing, Ninjutsu, kickboxing, and acrobatics. We also contribute the ReMoCap dataset for two person interactions containing full body and finger motions.
arXiv Detail & Related papers (2023-11-28T18:59:52Z)
Universal Humanoid Motion Representations for Physics-Based Control [71.46142106079292]
We present a universal motion representation that encompasses a comprehensive range of motor skills for physics-based humanoid control. We first learn a motion imitator that can imitate all of human motion from a large, unstructured motion dataset. We then create our motion representation by distilling skills directly from the imitator.
arXiv Detail & Related papers (2023-10-06T20:48:43Z)
Object Motion Guided Human Motion Synthesis [22.08240141115053]
We study the problem of full-body human motion synthesis for the manipulation of large-sized objects. We propose Object MOtion guided human MOtion synthesis (OMOMO), a conditional diffusion framework. We develop a novel system that captures full-body human manipulation motions by simply attaching a smartphone to the object being manipulated.
arXiv Detail & Related papers (2023-09-28T08:22:00Z)
Task-Oriented Human-Object Interactions Generation with Implicit Neural Representations [61.659439423703155]
TOHO: Task-Oriented Human-Object Interactions Generation with Implicit Neural Representations. Our method generates continuous motions that are parameterized only by the temporal coordinate. This work takes a step further toward general human-scene interaction simulation.
arXiv Detail & Related papers (2023-03-23T09:31:56Z)
IMoS: Intent-Driven Full-Body Motion Synthesis for Human-Object Interactions [69.95820880360345]
We present the first framework to synthesize the full-body motion of virtual human characters with 3D objects placed within their reach. Our system takes as input textual instructions specifying the objects and the associated intentions of the virtual characters. We show that our synthesized full-body motions appear more realistic to the participants in more than 80% of scenarios.
arXiv Detail & Related papers (2022-12-14T23:59:24Z)
MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model [35.32967411186489]
MotionDiffuse is a diffusion model-based text-driven motion generation framework. It excels at modeling complicated data distribution and generating vivid motion sequences. It responds to fine-grained instructions on body parts, and arbitrary-length motion synthesis with time-varied text prompts.
arXiv Detail & Related papers (2022-08-31T17:58:54Z)
MoDi: Unconditional Motion Synthesis from Diverse Data [51.676055380546494]
We present MoDi, an unconditional generative model that synthesizes diverse motions. Our model is trained in a completely unsupervised setting from a diverse, unstructured and unlabeled motion dataset. We show that despite the lack of any structure in the dataset, the latent space can be semantically clustered.
arXiv Detail & Related papers (2022-06-16T09:06:25Z)
Render In-between: Motion Guided Video Synthesis for Action Interpolation [53.43607872972194]
We propose a motion-guided frame-upsampling framework that is capable of producing realistic human motion and appearance. A novel motion model is trained to inference the non-linear skeletal motion between frames by leveraging a large-scale motion-capture dataset. Our pipeline only requires low-frame-rate videos and unpaired human motion data but does not require high-frame-rate videos for training.
arXiv Detail & Related papers (2021-11-01T15:32:51Z)
Task-Generic Hierarchical Human Motion Prior using VAEs [44.356707509079044]
A deep generative model that describes human motions can benefit a wide range of fundamental computer vision and graphics tasks. We present a method for learning complex human motions independent of specific tasks using a combined global and local latent space. We demonstrate the effectiveness of our hierarchical motion variational autoencoder in a variety of tasks including video-based human pose estimation.
arXiv Detail & Related papers (2021-06-07T23:11:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.