LocoMotion: Learning Motion-Focused Video-Language Representations
- URL: http://arxiv.org/abs/2410.12018v2
- Date: Wed, 23 Oct 2024 15:21:54 GMT
- Title: LocoMotion: Learning Motion-Focused Video-Language Representations
- Authors: Hazel Doughty, Fida Mohammad Thoker, Cees G. M. Snoek,
- Abstract summary: We propose LocoMotion to learn from motion-focused captions that describe the movement and temporal progression of local object motions.
We achieve this by adding synthetic motions to videos and using the parameters of these motions to generate corresponding captions.
- Score: 45.33444862034461
- License:
- Abstract: This paper strives for motion-focused video-language representations. Existing methods to learn video-language representations use spatial-focused data, where identifying the objects and scene is often enough to distinguish the relevant caption. We instead propose LocoMotion to learn from motion-focused captions that describe the movement and temporal progression of local object motions. We achieve this by adding synthetic motions to videos and using the parameters of these motions to generate corresponding captions. Furthermore, we propose verb-variation paraphrasing to increase the caption variety and learn the link between primitive motions and high-level verbs. With this, we are able to learn a motion-focused video-language representation. Experiments demonstrate our approach is effective for a variety of downstream tasks, particularly when limited data is available for fine-tuning. Code is available: https://hazeldoughty.github.io/Papers/LocoMotion/
Related papers
- Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion [9.134743677331517]
We propose a pre-trained image-to-video model to disentangle appearance from motion.
Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input.
By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity.
Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks.
arXiv Detail & Related papers (2024-08-01T10:55:20Z) - Diving Deep into the Motion Representation of Video-Text Models [12.197093960700187]
GPT-4 generated motion descriptions capture fine-grained motion descriptions of activities.
We evaluate several video-text models on the task of retrieval of motion descriptions.
arXiv Detail & Related papers (2024-06-07T16:46:10Z) - Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models.
Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model.
To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z) - Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation [32.11635464720755]
We propose to decouple video-level referring expression understanding into static and motion perception.
We employ contrastive learning to distinguish the motions of visually similar objects.
These contributions yield state-of-the-art performance across five datasets.
arXiv Detail & Related papers (2024-04-04T17:58:21Z) - OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs.
SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions.
Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z) - LivePhoto: Real Image Animation with Text-guided Motion Control [51.31418077586208]
This work presents a practical system, named LivePhoto, which allows users to animate an image of their interest with text descriptions.
We first establish a strong baseline that helps a well-learned text-to-image generator (i.e., Stable Diffusion) take an image as a further input.
We then equip the improved generator with a motion module for temporal modeling and propose a carefully designed training pipeline to better link texts and motions.
arXiv Detail & Related papers (2023-12-05T17:59:52Z) - MotionGPT: Human Motion as a Foreign Language [47.21648303282788]
Human motion displays a semantic coupling akin to human language, often perceived as a form of body language.
By fusing language data with large-scale motion models, motion-language pre-training can enhance the performance of motion-related tasks.
We propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks.
arXiv Detail & Related papers (2023-06-26T15:53:02Z) - OVC-Net: Object-Oriented Video Captioning with Temporal Graph and Detail
Enhancement [44.228748086927375]
We introduce the video-based object-oriented video captioning network (OVC)-Net via temporal graph and detail enhancement.
To demonstrate the effectiveness, we conduct experiments on the new dataset and compare it with the state-of-the-art video captioning methods.
arXiv Detail & Related papers (2020-03-08T04:34:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.