MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes
- URL: http://arxiv.org/abs/2508.05630v2
- Date: Mon, 22 Sep 2025 13:44:53 GMT
- Title: MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes
- Authors: Henghui Ding, Kaining Ying, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang, Philip H. S. Torr, Song Bai,
- Abstract summary: Video object segmentation (VOS) aims to segment specified target objects throughout a video.<n>To bridge this gap, the coMplex video Object SEgmentation dataset was introduced to facilitate VOS research in complex scenes.<n>We present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions.
- Score: 131.45528437023643
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F) on benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To bridge this gap, the coMplex video Object SEgmentation (MOSEv1) dataset was introduced to facilitate VOS research in complex scenes. Building on the foundations and insights of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions. MOSEv2 consists of 5,024 videos and 701,976 high-quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces much greater scene complexity, including {more frequent object disappearance and reappearance, severe occlusions and crowding, smaller objects, as well as a range of new challenges such as adverse weather (e.g., rain, snow, fog), low-light scenes (e.g., nighttime, underwater), multi-shot sequences, camouflaged objects, non-physical targets (e.g., shadows, reflections), and scenarios requiring external knowledge.} We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops on MOSEv2. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and observe similar declines, demonstrating that MOSEv2 poses challenges across tasks. These results highlight that despite strong performance on existing datasets, current VOS methods still fall short under real-world complexities. Based on our analysis of the observed challenges, we further propose several practical tricks that enhance model performance. MOSEv2 is publicly available at https://MOSE.video.
Related papers
- LSVOS 2025 Challenge Report: Recent Advances in Complex Video Object Segmentation [186.14566815158506]
This report presents an overview of the 7th Large-scale Video Object (LSVOS) Challenge held in conjunction with ICCV 2025.<n>The 2025 edition features a newly introduced track, Complex VOS (MOSEv2)<n>We summarize datasets and protocols, highlight top-performing solutions, and distill emerging trends.
arXiv Detail & Related papers (2025-10-13T07:02:09Z) - The 1st Solution for MOSEv1 Challenge on LSVOS 2025: CGFSeg [19.13013862040698]
Video Object (VOS) aims to track and segment specific objects across entire video sequences.<n>In this paper, we present our improved method, Confidence-Guided Fusion extraction (CGFSeg) for the VOS task in the MOSEv1 Challenge.<n>Our method achieves a J&F score of 86.37% on the test set, ranking 1st in the MOSEv1 Challenge at LSVOS 2025.
arXiv Detail & Related papers (2025-09-30T03:50:56Z) - 2nd Place Report of MOSEv2 Challenge 2025: Concept Guided Video Object Segmentation via SeC [46.76209037655681]
Semi-supervised Video Object aims to segment a specified target throughout a video sequence, by a first-frame mask.<n>SeC framework established a deep semantic understanding of the object for more persistent segmentation.<n>SeC achieved 39.7 JFn on the test set and ranked 2nd place in the Complex VOS track of the 7th Large-scale Video Object Challenge.
arXiv Detail & Related papers (2025-09-28T12:26:03Z) - The 1st Solution for MOSEv2 Challenge 2025: Long-term and Concept-aware Video Segmentation via SeC [59.53390730730018]
Solution achieves a JF score of 39.89% on the test set, ranking 1st in the MOSEv2 track of the LSVOS Challenge.
arXiv Detail & Related papers (2025-09-23T15:58:13Z) - SAMSON: 3rd Place Solution of LSVOS 2025 VOS Challenge [9.131199997701282]
Large-scale Video Object module (LSVOS) addresses the challenge of accurately tracking and segmenting objects in long video sequences.<n>Our method achieved a final performance of 0.8427 in terms of J &F in the test-set leaderboard.
arXiv Detail & Related papers (2025-09-22T08:30:34Z) - DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects [84.73092715537364]
In this paper, we study a new task of navigating to diverse target objects in a large number of scene types.
We build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning.
Our agent achieves a success rate that surpasses GPT-4o by over 20%.
arXiv Detail & Related papers (2024-10-03T17:49:28Z) - Discriminative Spatial-Semantic VOS Solution: 1st Place Solution for 6th LSVOS [68.47681139026666]
Video object segmentation (VOS) is a crucial task in computer vision.
Current VOS methods struggle with complex scenes and prolonged object motions.
This report introduces a discriminative spatial-temporal VOS model.
arXiv Detail & Related papers (2024-08-29T10:47:17Z) - LSVOS Challenge 3rd Place Report: SAM2 and Cutie based VOS [25.894649323139987]
We combine the strengths of the state-of-the-art (SOTA) models SAM2 and Cutie to address these challenges.
Our approach achieves a J&F score of 0.7952 in the testing phase of LSVOS challenge VOS track, ranking third overall.
arXiv Detail & Related papers (2024-08-20T00:45:13Z) - Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track [28.52754012142431]
Segment Anything Model 2 (SAM 2) is a foundation model towards solving promptable visual segmentation in images and videos.
SAM 2 builds a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date.
Without fine-tuning on the training set, SAM 2 achieved 75.79 J&F on the test set and ranked 4th place for 6th LSVOS Challenge VOS Track.
arXiv Detail & Related papers (2024-08-19T16:13:14Z) - 1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation [72.54357831350762]
We propose a semantic embedding video object segmentation model and use the salient features of objects as query representations.
We trained our model on a large-scale video object segmentation dataset.
Our model achieves first place (textbf84.45%) in the test set of Complex Video Object Challenge.
arXiv Detail & Related papers (2024-06-07T03:13:46Z) - 3rd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation [63.199793919573295]
Video Object (VOS) is a vital task in computer vision, focusing on distinguishing foreground objects from the background across video frames.
Our work draws inspiration from the Cutie model, and we investigate the effects of object memory, the total number of memory frames, and input resolution on segmentation performance.
arXiv Detail & Related papers (2024-06-06T00:56:25Z) - MF-MOS: A Motion-Focused Model for Moving Object Segmentation [10.533968185642415]
Moving object segmentation (MOS) provides a reliable solution for detecting traffic participants.
Previous methods capture motion features from the range images directly.
We propose MF-MOS, a novel motion-focused model with a dual-branch structure for LiDAR moving object segmentation.
arXiv Detail & Related papers (2024-01-30T13:55:56Z) - MOSE: A New Dataset for Video Object Segmentation in Complex Scenes [106.64327718262764]
Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence.
The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J&F) on existing datasets.
We collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex environments.
arXiv Detail & Related papers (2023-02-03T17:20:03Z) - Breaking the "Object" in Video Object Segmentation [36.20167854011788]
We present a dataset for Video Object under Transformations (VOST)
It consists of more than 700 high-resolution videos, captured in diverse environments, which are 21 seconds long average and densely labeled with masks instance.
A careful, multi-step approach is adopted to ensure that these videos focus on complex object transformations, capturing their full temporal extent.
We show that existing methods struggle when applied to this novel task and that their main limitation lies in over-reliance on static appearance cues.
arXiv Detail & Related papers (2022-12-12T19:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.