MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation
- URL: http://arxiv.org/abs/2512.10945v1
- Date: Thu, 11 Dec 2025 18:59:44 GMT
- Title: MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation
- Authors: Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, Yu-Gang Jiang,
- Abstract summary: We introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio.<n>We benchmark 15 existing methods across 4 tasks supported by MeViS.<n>We propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results.
- Score: 126.77662882743168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method's source code are publicly available at https://henghuiding.com/MeViS/
Related papers
- PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding? [9.059003409857775]
Multi-modal large language models (MLLMs) have shown impressive generalization across tasks using images and text modalities.<n>We raise the question of whether motion is used in pixel-level visual grounding and whether video MLLMs can segment objects based on natural language expressions.<n>We introduce four motion-centric probing techniques to study video MLLMs' ability to identify true motion from a fake one and their ability to grasp the motion order.
arXiv Detail & Related papers (2025-09-02T20:21:11Z) - VoCap: Video Object Captioning and Segmentation from Any Prompt [78.90048335805047]
VoCap is a flexible model that consumes a video segmentation and a prompt understanding of various modalities.<n>It addresses promptable video object segmentation, referring, and object captioning.<n>Our model yields state-the-art results on referring expression video object segmentation.
arXiv Detail & Related papers (2025-08-29T17:43:58Z) - MOVE: Motion-Guided Few-Shot Video Object Segmentation [25.624419551994354]
This work addresses motion-guided few-shot video object segmentation (FSVOS)<n>It aims to segment dynamic objects in videos based on a few annotated examples with the same motion patterns.<n>We introduce MOVE, a large-scale dataset specifically designed for motion-guided FSVOS.
arXiv Detail & Related papers (2025-07-29T17:59:35Z) - 4th PVUW MeViS 3rd Place Report: Sa2VA [105.88675577642204]
We show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS.<n>In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos.
arXiv Detail & Related papers (2025-04-01T07:06:47Z) - Segment Any Motion in Videos [80.72424676419755]
We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features.<n>Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support.
arXiv Detail & Related papers (2025-03-28T09:34:11Z) - 2nd Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation [8.20168024462357]
Motion Expression guided Video is a challenging task that aims at segmenting objects in the video based on natural language expressions with motion descriptions.
We introduce mask information obtained from the video instance segmentation model as preliminary information for temporal enhancement and employ SAM for spatial refinement.
Our method achieved a score of 49.92 J &F in the validation phase and 54.20 J &F in the test phase, securing the final ranking of 2nd in the MeViS Track at the CVPR 2024 PVUW Challenge.
arXiv Detail & Related papers (2024-06-20T02:16:23Z) - 3rd Place Solution for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation [13.622700558266658]
We propose using frozen pre-trained vision-language models (VLM) as backbones, with a specific emphasis on enhancing cross-modal feature interaction.
Firstly, we use frozen convolutional CLIP backbone to generate feature-aligned vision and text features, alleviating the issue of domain gap.
Secondly, we add more cross-modal feature fusion in the pipeline to enhance the utilization of multi-modal information.
arXiv Detail & Related papers (2024-06-07T11:15:03Z) - MeViS: A Large-scale Benchmark for Video Segmentation with Motion
Expressions [93.35942025232943]
We propose a large-scale dataset called MeViS, which contains numerous motion expressions to indicate target objects in complex environments.
The goal of our benchmark is to provide a platform that enables the development of effective language-guided video segmentation algorithms.
arXiv Detail & Related papers (2023-08-16T17:58:34Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.