LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning
- URL: http://arxiv.org/abs/2410.07093v1
- Date: Wed, 9 Oct 2024 17:33:03 GMT
- Title: LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning
- Authors: Zhe Li, Weihao Yuan, Yisheng He, Lingteng Qiu, Shenhao Zhu, Xiaodong Gu, Weichao Shen, Yuan Dong, Zilong Dong, Laurence T. Yang,
- Abstract summary: This work introduces LaMP, a novel Language-Motion Pretraining model.
LaMP generates motion-informative text embeddings, significantly enhancing the relevance and semantics of generated motion sequences.
For captioning, we finetune a large language model with the language-informative motion features to develop a strong motion captioning model.
- Score: 19.801187860991117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language plays a vital role in the realm of human motion. Existing methods have largely depended on CLIP text embeddings for motion generation, yet they fall short in effectively aligning language and motion due to CLIP's pretraining on static image-text pairs. This work introduces LaMP, a novel Language-Motion Pretraining model, which transitions from a language-vision to a more suitable language-motion latent space. It addresses key limitations by generating motion-informative text embeddings, significantly enhancing the relevance and semantics of generated motion sequences. With LaMP, we advance three key tasks: text-to-motion generation, motion-text retrieval, and motion captioning through aligned language-motion representation learning. For generation, we utilize LaMP to provide the text condition instead of CLIP, and an autoregressive masked prediction is designed to achieve mask modeling without rank collapse in transformers. For retrieval, motion features from LaMP's motion transformer interact with query tokens to retrieve text features from the text transformer, and vice versa. For captioning, we finetune a large language model with the language-informative motion features to develop a strong motion captioning model. In addition, we introduce the LaMP-BertScore metric to assess the alignment of generated motions with textual descriptions. Extensive experimental results on multiple datasets demonstrate substantial improvements over previous methods across all three tasks. The code of our method will be made public.
Related papers
- An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs [7.630967411418269]
Gloss-free Sign Language Translation (SLT) converts sign videos directly into spoken language sentences without relying on glosses.
This paper emphasizes the importance of capturing the spatial configurations and motion dynamics inherent in sign language.
We introduce Spatial and Motion-based Sign Language Translation (SpaMo), a novel LLM-based SLT framework.
arXiv Detail & Related papers (2024-08-20T07:10:40Z) - EvSign: Sign Language Recognition and Translation with Streaming Events [59.51655336911345]
Event camera could naturally perceive dynamic hand movements, providing rich manual clues for sign language tasks.
We propose efficient transformer-based framework for event-based SLR and SLT tasks.
Our method performs favorably against existing state-of-the-art approaches with only 0.34% computational cost.
arXiv Detail & Related papers (2024-07-17T14:16:35Z) - Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs [67.59291068131438]
Motion-Agent is a conversational framework designed for general human motion generation, editing, and understanding.
Motion-Agent employs an open-source pre-trained language model to develop a generative agent, MotionLLM, that bridges the gap between motion and text.
arXiv Detail & Related papers (2024-05-27T09:57:51Z) - Plan, Posture and Go: Towards Open-World Text-to-Motion Generation [43.392549755386135]
We present a divide-and-conquer framework named PRO-Motion.
It consists of three modules as motion planner, posture-diffuser and go-diffuser.
Pro-Motion can generate diverse and realistic motions from complex open-world prompts.
arXiv Detail & Related papers (2023-12-22T17:02:45Z) - LivePhoto: Real Image Animation with Text-guided Motion Control [51.31418077586208]
This work presents a practical system, named LivePhoto, which allows users to animate an image of their interest with text descriptions.
We first establish a strong baseline that helps a well-learned text-to-image generator (i.e., Stable Diffusion) take an image as a further input.
We then equip the improved generator with a motion module for temporal modeling and propose a carefully designed training pipeline to better link texts and motions.
arXiv Detail & Related papers (2023-12-05T17:59:52Z) - MotionGPT: Human Motion as a Foreign Language [47.21648303282788]
Human motion displays a semantic coupling akin to human language, often perceived as a form of body language.
By fusing language data with large-scale motion models, motion-language pre-training can enhance the performance of motion-related tasks.
We propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks.
arXiv Detail & Related papers (2023-06-26T15:53:02Z) - Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion
Data and Natural Language [4.86658723641864]
We propose a novel text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural description.
Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions.
arXiv Detail & Related papers (2023-05-25T08:32:41Z) - Think Before You Act: Unified Policy for Interleaving Language Reasoning
with Actions [21.72567982148215]
We show how to train transformers with a similar next-step prediction objective on offline data.
We propose a novel method for unifying language reasoning with actions in a single policy.
Specifically, we augment a transformer policy with word outputs, so it can generate textual captions interleaved with actions.
arXiv Detail & Related papers (2023-04-18T16:12:38Z) - Being Comes from Not-being: Open-vocabulary Text-to-Motion Generation
with Wordless Training [178.09150600453205]
In this paper, we investigate offline open-vocabulary text-to-motion generation in a zero-shot learning manner.
Inspired by the prompt learning in NLP, we pretrain a motion generator that learns to reconstruct the full motion from the masked motion.
Our method reformulates the input text into a masked motion as the prompt for the motion generator to reconstruct'' the motion.
arXiv Detail & Related papers (2022-10-28T06:20:55Z) - Text-driven Video Prediction [83.04845684117835]
We propose a new task called Text-driven Video Prediction (TVP)
Taking the first frame and text caption as inputs, this task aims to synthesize the following frames.
To investigate the capability of text in causal inference for progressive motion information, our TVP framework contains a Text Inference Module (TIM)
arXiv Detail & Related papers (2022-10-06T12:43:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.