Diving Deep into the Motion Representation of Video-Text Models
- URL: http://arxiv.org/abs/2406.05075v1
- Date: Fri, 7 Jun 2024 16:46:10 GMT
- Title: Diving Deep into the Motion Representation of Video-Text Models
- Authors: Chinmaya Devaraj, Cornelia Fermuller, Yiannis Aloimonos,
- Abstract summary: GPT-4 generated motion descriptions capture fine-grained motion descriptions of activities.
We evaluate several video-text models on the task of retrieval of motion descriptions.
- Score: 12.197093960700187
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Videos are more informative than images because they capture the dynamics of the scene. By representing motion in videos, we can capture dynamic activities. In this work, we introduce GPT-4 generated motion descriptions that capture fine-grained motion descriptions of activities and apply them to three action datasets. We evaluated several video-text models on the task of retrieval of motion descriptions. We found that they fall far behind human expert performance on two action datasets, raising the question of whether video-text models understand motion in videos. To address it, we introduce a method of improving motion understanding in video-text models by utilizing motion descriptions. This method proves to be effective on two action datasets for the motion description retrieval task. The results draw attention to the need for quality captions involving fine-grained motion information in existing datasets and demonstrate the effectiveness of the proposed pipeline in understanding fine-grained motion during video-text retrieval.
Related papers
- MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent [58.09607975296408]
We propose MotionAgent, enabling fine-grained motion control for text-guided image-to-video generation.
The key technique is the motion field agent that converts motion information in text prompts into explicit motion fields.
We construct a subset of VBench to evaluate the alignment of motion information in the text and the generated video, outperforming other advanced models on motion generation accuracy.
arXiv Detail & Related papers (2025-02-05T14:26:07Z) - Move-in-2D: 2D-Conditioned Human Motion Generation [54.067588636155115]
We propose Move-in-2D, a novel approach to generate human motion sequences conditioned on a scene image.
Our approach accepts both a scene image and text prompt as inputs, producing a motion sequence tailored to the scene.
arXiv Detail & Related papers (2024-12-17T18:58:07Z) - Motion Prompting: Controlling Video Generation with Motion Trajectories [57.049252242807874]
We train a video generation model conditioned on sparse or dense video trajectories.
We translate high-level user requests into detailed, semi-dense motion prompts.
We demonstrate our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing.
arXiv Detail & Related papers (2024-12-03T18:59:56Z) - LocoMotion: Learning Motion-Focused Video-Language Representations [45.33444862034461]
We propose LocoMotion to learn from motion-focused captions that describe the movement and temporal progression of local object motions.
We achieve this by adding synthetic motions to videos and using the parameters of these motions to generate corresponding captions.
arXiv Detail & Related papers (2024-10-15T19:33:57Z) - MotionFix: Text-Driven 3D Human Motion Editing [52.11745508960547]
Key challenges include the scarcity of training data and the need to design a model that accurately edits the source motion.
We propose a methodology to semi-automatically collect a dataset of triplets comprising (i) a source motion, (ii) a target motion, and (iii) an edit text.
Access to this data allows us to train a conditional diffusion model, TMED, that takes both the source motion and the edit text as input.
arXiv Detail & Related papers (2024-08-01T16:58:50Z) - Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion [9.134743677331517]
We propose a pre-trained image-to-video model to disentangle appearance from motion.
Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input.
By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity.
Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks.
arXiv Detail & Related papers (2024-08-01T10:55:20Z) - MotionLLM: Understanding Human Behaviors from Human Motions and Videos [40.132643319573205]
This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding.
We present MotionLLM, a framework for human motion understanding, captioning, and reasoning.
arXiv Detail & Related papers (2024-05-30T17:59:50Z) - Motion Generation from Fine-grained Textual Descriptions [29.033358642532722]
We build a large-scale language-motion dataset specializing in fine-grained textual descriptions, FineHumanML3D.
We design a new text2motion model, FineMotionDiffuse, making full use of fine-grained textual information.
Our evaluation shows that FineMotionDiffuse trained on FineHumanML3D improves FID by a large margin of 0.38, compared with competitive baselines.
arXiv Detail & Related papers (2024-03-20T11:38:30Z) - Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs.
SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions.
Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z) - LivePhoto: Real Image Animation with Text-guided Motion Control [51.31418077586208]
This work presents a practical system, named LivePhoto, which allows users to animate an image of their interest with text descriptions.
We first establish a strong baseline that helps a well-learned text-to-image generator (i.e., Stable Diffusion) take an image as a further input.
We then equip the improved generator with a motion module for temporal modeling and propose a carefully designed training pipeline to better link texts and motions.
arXiv Detail & Related papers (2023-12-05T17:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.