MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs
- URL: http://arxiv.org/abs/2506.01674v1
- Date: Mon, 02 Jun 2025 13:44:56 GMT
- Title: MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs
- Authors: Yipeng Du, Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Xiang Li, Jian Yang, Zhenheng Yang, Ying Tai,
- Abstract summary: We introduce MotionSight, a novel zero-shot method pioneering object-centric visual spotlight and motion blur as visual prompts to improve fine-grained motion understanding without training.<n>We curated MotionVid-QA, the first large-scale dataset for fine-grained video motion understanding, with hierarchical annotations including SFT and preference data, Theta(40K) video clips and Theta(87K) QAs. Experiments show MotionSight achieves state-of-the-art open-source performance and competitiveness with commercial models.
- Score: 32.761738388461595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite advancements in Multimodal Large Language Models (MLLMs), their proficiency in fine-grained video motion understanding remains critically limited. They often lack inter-frame differencing and tend to average or ignore subtle visual cues. Furthermore, while visual prompting has shown potential in static images, its application to video's temporal complexities, particularly for fine-grained motion understanding, remains largely unexplored. We investigate whether inherent capability can be unlocked and boost MLLMs' motion perception and enable distinct visual signatures tailored to decouple object and camera motion cues. In this study, we introduce MotionSight, a novel zero-shot method pioneering object-centric visual spotlight and motion blur as visual prompts to effectively improve fine-grained motion understanding without training. To convert this into valuable data assets, we curated MotionVid-QA, the first large-scale dataset for fine-grained video motion understanding, with hierarchical annotations including SFT and preference data, {\Theta}(40K) video clips and {\Theta}(87K) QAs. Experiments show MotionSight achieves state-of-the-art open-source performance and competitiveness with commercial models. In particular, for fine-grained motion understanding we present a novel zero-shot technique and a large-scale, high-quality dataset. All the code and annotations will be publicly available.
Related papers
- SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation [56.90807453045657]
SynMotion is a motion-customized video generation model that jointly leverages semantic guidance and visual adaptation.<n>At the semantic level, we introduce the dual-em semantic comprehension mechanism which disentangles subject and motion representations.<n>At the visual level, we integrate efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence.
arXiv Detail & Related papers (2025-06-30T10:09:32Z) - Segment Any Motion in Videos [80.72424676419755]
We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features.<n>Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support.
arXiv Detail & Related papers (2025-03-28T09:34:11Z) - MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models [30.139277087078764]
MotionBench is an evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models.<n>It includes data collected from diverse sources, ensuring a broad representation of real-world video content.<n>Our benchmark aims to guide and motivate the development of more capable video understanding models.
arXiv Detail & Related papers (2025-01-06T11:57:38Z) - Motion Prompting: Controlling Video Generation with Motion Trajectories [57.049252242807874]
We train a video generation model conditioned on sparse or dense video trajectories.<n>We translate high-level user requests into detailed, semi-dense motion prompts.<n>We demonstrate our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing.
arXiv Detail & Related papers (2024-12-03T18:59:56Z) - Scaling Large Motion Models with Million-Level Human Motions [67.40066387326141]
We present MotionLib, the first million-level dataset for motion generation.<n>We train a large motion model named projname, demonstrating robust performance across a wide range of human activities.
arXiv Detail & Related papers (2024-10-04T10:48:54Z) - MotionLLM: Understanding Human Behaviors from Human Motions and Videos [40.132643319573205]
This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding.
We present MotionLLM, a framework for human motion understanding, captioning, and reasoning.
arXiv Detail & Related papers (2024-05-30T17:59:50Z) - Hawk: Learning to Understand Open-World Video Anomalies [76.9631436818573]
Video Anomaly Detection (VAD) systems can autonomously monitor and identify disturbances, reducing the need for manual labor and associated costs.
We introduce Hawk, a novel framework that leverages interactive large Visual Language Models (VLM) to interpret video anomalies precisely.
We have annotated over 8,000 anomaly videos with language descriptions, enabling effective training across diverse open-world scenarios, and also created 8,000 question-answering pairs for users' open-world questions.
arXiv Detail & Related papers (2024-05-27T07:08:58Z) - VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models.
Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference.
We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z) - Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning [16.094271750354835]
Motion information is critical to a robust and generalized video representation.
Recent works have adopted frame difference as the source of motion information in video contrastive learning.
We present a framework capable of introducing well-aligned and significant motion information.
arXiv Detail & Related papers (2023-09-01T07:03:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.