PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?
- URL: http://arxiv.org/abs/2509.02807v1
- Date: Tue, 02 Sep 2025 20:21:11 GMT
- Title: PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?
- Authors: Mennatullah Siam,
- Abstract summary: Multi-modal large language models (MLLMs) have shown impressive generalization across tasks using images and text modalities.<n>We raise the question of whether motion is used in pixel-level visual grounding and whether video MLLMs can segment objects based on natural language expressions.<n>We introduce four motion-centric probing techniques to study video MLLMs' ability to identify true motion from a fake one and their ability to grasp the motion order.
- Score: 9.059003409857775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal large language models (MLLMs) have shown impressive generalization across tasks using images and text modalities. While their extension to video has enabled tasks such as video question answering and video captioning, their pixel-level visual grounding abilities are less studied. In this work, we raise the pertinent question of whether motion is used in pixel-level visual grounding and whether video MLLMs can segment objects based on natural language expressions describing their motion patterns. We identify the shortcomings in the current benchmarks, where we show that a single frame can often suffice for capturing the motion referring expression without any temporal reasoning. To address this, we introduce four motion-centric probing techniques, particularly designed for the visual grounding task, to study video MLLMs' ability to identify true motion from a fake one and their ability to grasp the motion order. Consequently, we provide a motion-centric benchmark, MoCentric-Bench. It ensures that video MLLMs are evaluated towards leveraging the interaction between motion and language rather than being dominated by static appearance cues emphasized in existing visual grounding datasets. We further establish strong single-image baselines that are on par with or outperform prior methods. Finally, we explore simple motion-centric adaptation techniques that provide state-of-the-art performance on our MoCentric-Bench. Our motion-centric benchmark, evaluation and findings challenge future models to improve dense spatiotemporal grounding and pixel-level understanding within videos. Code and datasets will be made publicly available at https://github.com/MSiam/PixFoundation-2.0.git.
Related papers
- MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation [126.77662882743168]
We introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio.<n>We benchmark 15 existing methods across 4 tasks supported by MeViS.<n>We propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results.
arXiv Detail & Related papers (2025-12-11T18:59:44Z) - DisMo: Disentangled Motion Representations for Open-World Motion Transfer [21.557843791867906]
DisMo is a novel paradigm for learning abstract motion representations directly from raw video data.<n>Our representation is generic and independent of static information such as appearance, object identity, or pose.<n>We show that the learned representations are well-suited for downstream motion understanding tasks.
arXiv Detail & Related papers (2025-11-28T18:25:54Z) - MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs [32.761738388461595]
We introduce MotionSight, a novel zero-shot method pioneering object-centric visual spotlight and motion blur as visual prompts to improve fine-grained motion understanding without training.<n>We curated MotionVid-QA, the first large-scale dataset for fine-grained video motion understanding, with hierarchical annotations including SFT and preference data, Theta(40K) video clips and Theta(87K) QAs. Experiments show MotionSight achieves state-of-the-art open-source performance and competitiveness with commercial models.
arXiv Detail & Related papers (2025-06-02T13:44:56Z) - Towards Understanding Camera Motions in Any Video [89.97247162415158]
We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding.<n>CameraBench consists of 3,000 diverse internet videos annotated by experts through a rigorous quality control process.<n>One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers.
arXiv Detail & Related papers (2025-04-21T18:34:57Z) - SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning [50.98341607245458]
Masked video modeling is an effective paradigm for video self-supervised learning (SSL)<n>This paper introduces a novel SSL approach for video representation learning, dubbed as SMILE, by infusing both spatial and motion semantics.<n>We establish a new self-supervised video learning paradigm capable of learning strong video representations without requiring any natural video data.
arXiv Detail & Related papers (2025-04-01T08:20:55Z) - Segment Any Motion in Videos [80.72424676419755]
We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features.<n>Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support.
arXiv Detail & Related papers (2025-03-28T09:34:11Z) - Motion Prompting: Controlling Video Generation with Motion Trajectories [57.049252242807874]
We train a video generation model conditioned on sparse or dense video trajectories.<n>We translate high-level user requests into detailed, semi-dense motion prompts.<n>We demonstrate our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing.
arXiv Detail & Related papers (2024-12-03T18:59:56Z) - LocoMotion: Learning Motion-Focused Video-Language Representations [45.33444862034461]
We propose LocoMotion to learn from motion-focused captions that describe the movement and temporal progression of local object motions.
We achieve this by adding synthetic motions to videos and using the parameters of these motions to generate corresponding captions.
arXiv Detail & Related papers (2024-10-15T19:33:57Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.