MotionCLIP: Exposing Human Motion Generation to CLIP Space
- URL: http://arxiv.org/abs/2203.08063v1
- Date: Tue, 15 Mar 2022 16:56:22 GMT
- Title: MotionCLIP: Exposing Human Motion Generation to CLIP Space
- Authors: Guy Tevet, Brian Gordon, Amir Hertz, Amit H. Bermano, Daniel Cohen-Or
- Abstract summary: We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports semantic descriptions.
MotionCLIP gains its unique power by aligning its latent space with that of the Contrastive LanguageImage Pre-training (CLIP) model.
MotionCLIP offers unprecedented text-to-motion abilities, allowing out-of-domain actions, disentangled editing, and abstract language specification.
- Score: 40.77049019470539
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent
embedding that is disentangled, well behaved, and supports highly semantic
textual descriptions. MotionCLIP gains its unique power by aligning its latent
space with that of the Contrastive Language-Image Pre-training (CLIP) model.
Aligning the human motion manifold to CLIP space implicitly infuses the
extremely rich semantic knowledge of CLIP into the manifold. In particular, it
helps continuity by placing semantically similar motions close to one another,
and disentanglement, which is inherited from the CLIP-space structure.
MotionCLIP comprises a transformer-based motion auto-encoder, trained to
reconstruct motion while being aligned to its text label's position in
CLIP-space. We further leverage CLIP's unique visual understanding and inject
an even stronger signal through aligning motion to rendered frames in a
self-supervised manner. We show that although CLIP has never seen the motion
domain, MotionCLIP offers unprecedented text-to-motion abilities, allowing
out-of-domain actions, disentangled editing, and abstract language
specification. For example, the text prompt "couch" is decoded into a sitting
down motion, due to lingual similarity, and the prompt "Spiderman" results in a
web-swinging-like solution that is far from seen during training. In addition,
we show how the introduced latent space can be leveraged for motion
interpolation, editing and recognition.
Related papers
- MoCLIP: Motion-Aware Fine-Tuning and Distillation of CLIP for Human Motion Generation [2.621434923709917]
This work introduces MoCLIP, a fine-tuned CLIP model with an additional motion encoding head, trained on motion sequences using contrastive learning and tethering loss.<n>Experiments demonstrate that MoCLIP improves Top-1, Top-2, and Top-3 accuracy while maintaining competitive FID, leading to improved text-to-motion alignment results.
arXiv Detail & Related papers (2025-05-16T03:11:00Z) - LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning [19.801187860991117]
This work introduces LaMP, a novel Language-Motion Pretraining model.
LaMP generates motion-informative text embeddings, significantly enhancing the relevance and semantics of generated motion sequences.
For captioning, we finetune a large language model with the language-informative motion features to develop a strong motion captioning model.
arXiv Detail & Related papers (2024-10-09T17:33:03Z) - OmniCLIP: Adapting CLIP for Video Recognition with Spatial-Temporal Omni-Scale Feature Learning [8.707819647492467]
We propose a framework that adapts CLIP for video recognition by focusing on learning comprehensive features encompassing spatial, temporal, and dynamic spatial-temporal scales.
We have conducted extensive experiments in supervised video recognition, few-shot video recognition, and zero-shot recognition tasks.
The results demonstrate the effectiveness of our method, especially with OmniCLIP achieving a top-1 accuracy of 74.30% on HMDB51 in a 16-shot setting.
arXiv Detail & Related papers (2024-08-12T13:55:46Z) - MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition [94.56755080185732]
We propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information.
Our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation.
arXiv Detail & Related papers (2024-05-31T08:06:05Z) - Plan, Posture and Go: Towards Open-World Text-to-Motion Generation [43.392549755386135]
We present a divide-and-conquer framework named PRO-Motion.
It consists of three modules as motion planner, posture-diffuser and go-diffuser.
Pro-Motion can generate diverse and realistic motions from complex open-world prompts.
arXiv Detail & Related papers (2023-12-22T17:02:45Z) - MotionCrafter: One-Shot Motion Customization of Diffusion Models [66.44642854791807]
We introduce MotionCrafter, a one-shot instance-guided motion customization method.
MotionCrafter employs a parallel spatial-temporal architecture that injects the reference motion into the temporal component of the base model.
During training, a frozen base model provides appearance normalization, effectively separating appearance from motion.
arXiv Detail & Related papers (2023-12-08T16:31:04Z) - CALM: Conditional Adversarial Latent Models for Directable Virtual
Characters [71.66218592749448]
We present Conditional Adversarial Latent Models (CALM), an approach for generating diverse and directable behaviors for user-controlled interactive virtual characters.
Using imitation learning, CALM learns a representation of movement that captures the complexity of human motion, and enables direct control over character movements.
arXiv Detail & Related papers (2023-05-02T09:01:44Z) - GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents [3.229105662984031]
GestureDiffuCLIP is a neural network framework for synthesizing realistic, stylized co-speech gestures with flexible style control.
Our system learns a latent diffusion model to generate high-quality gestures and infuses the CLIP representations of style into the generator.
Our system can be extended to allow fine-grained style control of individual body parts.
arXiv Detail & Related papers (2023-03-26T03:35:46Z) - CLIP2GAN: Towards Bridging Text with the Latent Space of GANs [128.47600914674985]
We propose a novel framework, i.e., CLIP2GAN, by leveraging CLIP model and StyleGAN.
The key idea of our CLIP2GAN is to bridge the output feature embedding space of CLIP and the input latent space of StyleGAN.
arXiv Detail & Related papers (2022-11-28T04:07:17Z) - CLIP-Actor: Text-Driven Recommendation and Stylization for Animating
Human Meshes [17.22112222736234]
We propose CLIP-Actor, a text-driven motion recommendation and neural mesh stylization system for human mesh animation.
It animates a 3D human mesh to conform to a text prompt by recommending a motion sequence and learning mesh style attributes.
We demonstrate that CLIP-Actor produces plausible and human-recognizable style 3D human mesh in motion with detailed geometry and texture from a natural language prompt.
arXiv Detail & Related papers (2022-06-09T09:50:39Z) - MoCaNet: Motion Retargeting in-the-wild via Canonicalization Networks [77.56526918859345]
We present a novel framework that brings the 3D motion task from controlled environments to in-the-wild scenarios.
It is capable of body motion from a character in a 2D monocular video to a 3D character without using any motion capture system or 3D reconstruction procedure.
arXiv Detail & Related papers (2021-12-19T07:52:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.