Related papers: MoCLIP: Motion-Aware Fine-Tuning and Distillation of CLIP for Human Motion Generation

MoCLIP: Motion-Aware Fine-Tuning and Distillation of CLIP for Human Motion Generation

URL: http://arxiv.org/abs/2505.10810v1
Date: Fri, 16 May 2025 03:11:00 GMT
Title: MoCLIP: Motion-Aware Fine-Tuning and Distillation of CLIP for Human Motion Generation
Authors: Gabriel Maldonado, Armin Danesh Pazho, Ghazal Alinezhad Noghre, Vinit Katariya, Hamed Tabkhi,
Abstract summary: This work introduces MoCLIP, a fine-tuned CLIP model with an additional motion encoding head, trained on motion sequences using contrastive learning and tethering loss.<n>Experiments demonstrate that MoCLIP improves Top-1, Top-2, and Top-3 accuracy while maintaining competitive FID, leading to improved text-to-motion alignment results.
Score: 2.621434923709917
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human motion generation is essential for fields such as animation, robotics, and virtual reality, requiring models that effectively capture motion dynamics from text descriptions. Existing approaches often rely on Contrastive Language-Image Pretraining (CLIP)-based text encoders, but their training on text-image pairs constrains their ability to understand temporal and kinematic structures inherent in motion and motion generation. This work introduces MoCLIP, a fine-tuned CLIP model with an additional motion encoding head, trained on motion sequences using contrastive learning and tethering loss. By explicitly incorporating motion-aware representations, MoCLIP enhances motion fidelity while remaining compatible with existing CLIP-based pipelines and seamlessly integrating into various CLIP-based methods. Experiments demonstrate that MoCLIP improves Top-1, Top-2, and Top-3 accuracy while maintaining competitive FID, leading to improved text-to-motion alignment results. These results highlight MoCLIP's versatility and effectiveness, establishing it as a robust framework for enhancing motion generation.

Related papers

SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation [56.90807453045657]
SynMotion is a motion-customized video generation model that jointly leverages semantic guidance and visual adaptation.<n>At the semantic level, we introduce the dual-em semantic comprehension mechanism which disentangles subject and motion representations.<n>At the visual level, we integrate efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence.
arXiv Detail & Related papers (2025-06-30T10:09:32Z)
ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.<n>The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.<n>To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z)
Dynamic Motion Blending for Versatile Motion Editing [43.10279926787476]
We introduce MotionMixCut, an online data augmentation technique that generates training triplets by blending body part motions based on input text.<n>We present MotionReFit, an auto-regressive diffusion model with a motion coordinator.<n>Our method handles both spatial and temporal motion edits directly from high-level human instructions.
arXiv Detail & Related papers (2025-03-26T17:07:24Z)
MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding [76.30210465222218]
MotionGPT-2 is a unified Large Motion-Language Model (LMLMLM) It supports multimodal control conditions through pre-trained Large Language Models (LLMs) It is highly adaptable to the challenging 3D holistic motion generation task.
arXiv Detail & Related papers (2024-10-29T05:25:34Z)
LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning [19.801187860991117]
This work introduces LaMP, a novel Language-Motion Pretraining model.<n>LaMP generates motion-informative text embeddings, significantly enhancing the relevance and semantics of generated motion sequences.<n>For captioning, we finetune a large language model with the language-informative motion features to develop a strong motion captioning model.
arXiv Detail & Related papers (2024-10-09T17:33:03Z)
MotionClone: Training-Free Motion Cloning for Controllable Video Generation [41.621147782128396]
MotionClone is a training-free framework that enables motion cloning from reference videos to versatile motion-controlled video generation. MotionClone exhibits proficiency in both global camera motion and local object motion, with notable superiority in terms of motion fidelity, textual alignment, and temporal consistency.
arXiv Detail & Related papers (2024-06-08T03:44:25Z)
Spectral Motion Alignment for Video Motion Transfer using Diffusion Models [54.32923808964701]
Spectral Motion Alignment (SMA) is a framework that refines and aligns motion vectors using Fourier and wavelet transforms.<n> SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics.<n>Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.
arXiv Detail & Related papers (2024-03-22T14:47:18Z)
Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning [14.292812802621707]
Contrastive Language-Image Pre-training (CLIP) has recently shown remarkable generalization on "zero-shot" training. We explore the adaptation of CLIP to achieve a more efficient and generalized action recognition method. Our method outperforms most existing state-of-the-art methods by a significant margin on "few-shot" and "zero-shot" training.
arXiv Detail & Related papers (2023-08-09T09:33:45Z)
CALM: Conditional Adversarial Latent Models for Directable Virtual Characters [71.66218592749448]
We present Conditional Adversarial Latent Models (CALM), an approach for generating diverse and directable behaviors for user-controlled interactive virtual characters. Using imitation learning, CALM learns a representation of movement that captures the complexity of human motion, and enables direct control over character movements.
arXiv Detail & Related papers (2023-05-02T09:01:44Z)
Being Comes from Not-being: Open-vocabulary Text-to-Motion Generation with Wordless Training [178.09150600453205]
In this paper, we investigate offline open-vocabulary text-to-motion generation in a zero-shot learning manner. Inspired by the prompt learning in NLP, we pretrain a motion generator that learns to reconstruct the full motion from the masked motion. Our method reformulates the input text into a masked motion as the prompt for the motion generator to reconstruct'' the motion.
arXiv Detail & Related papers (2022-10-28T06:20:55Z)
MotionCLIP: Exposing Human Motion Generation to CLIP Space [40.77049019470539]
We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports semantic descriptions. MotionCLIP gains its unique power by aligning its latent space with that of the Contrastive LanguageImage Pre-training (CLIP) model. MotionCLIP offers unprecedented text-to-motion abilities, allowing out-of-domain actions, disentangled editing, and abstract language specification.
arXiv Detail & Related papers (2022-03-15T16:56:22Z)
AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control [145.61135774698002]
We propose a fully automated approach to selecting motion for a character to track in a given scenario. High-level task objectives that the character should perform can be specified by relatively simple reward functions. Low-level style of the character's behaviors can be specified by a dataset of unstructured motion clips. Our system produces high-quality motions comparable to those achieved by state-of-the-art tracking-based techniques.
arXiv Detail & Related papers (2021-04-05T22:43:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.