MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
- URL: http://arxiv.org/abs/2302.01872v1
- Date: Fri, 3 Feb 2023 17:20:03 GMT
- Title: MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
- Authors: Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip H.S. Torr,
Song Bai
- Abstract summary: Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence.
The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J&F) on existing datasets.
We collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex environments.
- Score: 106.64327718262764
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video object segmentation (VOS) aims at segmenting a particular object
throughout the entire video clip sequence. The state-of-the-art VOS methods
have achieved excellent performance (e.g., 90+% J&F) on existing datasets.
However, since the target objects in these existing datasets are usually
relatively salient, dominant, and isolated, VOS under complex scenes has rarely
been studied. To revisit VOS and make it more applicable in the real world, we
collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to
study the tracking and segmenting objects in complex environments. MOSE
contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725
high-quality object segmentation masks. The most notable feature of MOSE
dataset is complex scenes with crowded and occluded objects. The target objects
in the videos are commonly occluded by others and disappear in some frames. To
analyze the proposed MOSE dataset, we benchmark 18 existing VOS methods under 4
different settings on the proposed MOSE dataset and conduct comprehensive
comparisons. The experiments show that current VOS algorithms cannot well
perceive objects in complex scenes. For example, under the semi-supervised VOS
setting, the highest J&F by existing state-of-the-art VOS methods is only 59.4%
on MOSE, much lower than their ~90% J&F performance on DAVIS. The results
reveal that although excellent performance has been achieved on existing
benchmarks, there are unresolved challenges under complex scenes and more
efforts are desired to explore these challenges in the future. The proposed
MOSE dataset has been released at https://henghuiding.github.io/MOSE.
Related papers
- Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track [28.52754012142431]
Segment Anything Model 2 (SAM 2) is a foundation model towards solving promptable visual segmentation in images and videos.
SAM 2 builds a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date.
Without fine-tuning on the training set, SAM 2 achieved 75.79 J&F on the test set and ranked 4th place for 6th LSVOS Challenge VOS Track.
arXiv Detail & Related papers (2024-08-19T16:13:14Z) - 1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation [72.54357831350762]
We propose a semantic embedding video object segmentation model and use the salient features of objects as query representations.
We trained our model on a large-scale video object segmentation dataset.
Our model achieves first place (textbf84.45%) in the test set of Complex Video Object Challenge.
arXiv Detail & Related papers (2024-06-07T03:13:46Z) - 3rd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation [63.199793919573295]
Video Object (VOS) is a vital task in computer vision, focusing on distinguishing foreground objects from the background across video frames.
Our work draws inspiration from the Cutie model, and we investigate the effects of object memory, the total number of memory frames, and input resolution on segmentation performance.
arXiv Detail & Related papers (2024-06-06T00:56:25Z) - Learning Cross-Modal Affinity for Referring Video Object Segmentation
Targeting Limited Samples [61.66967790884943]
Referring video object segmentation (RVOS) relies on sufficient data for a given scene.
In more realistic scenarios, only minimal annotations are available for a new scene.
We propose a model with a newly designed cross-modal affinity (CMA) module based on a Transformer architecture.
CMA module builds multimodal affinity with a few samples, thus quickly learning new semantic information, and enabling the model to adapt to different scenarios.
arXiv Detail & Related papers (2023-09-05T08:34:23Z) - MeViS: A Large-scale Benchmark for Video Segmentation with Motion
Expressions [93.35942025232943]
We propose a large-scale dataset called MeViS, which contains numerous motion expressions to indicate target objects in complex environments.
The goal of our benchmark is to provide a platform that enables the development of effective language-guided video segmentation algorithms.
arXiv Detail & Related papers (2023-08-16T17:58:34Z) - Region Aware Video Object Segmentation with Deep Motion Modeling [56.95836951559529]
Region Aware Video Object (RAVOS) is a method that predicts regions of interest for efficient object segmentation and memory storage.
For efficient segmentation, object features are extracted according to the ROIs, and an object decoder is designed for object-level segmentation.
For efficient memory storage, we propose motion path memory to filter out redundant context by memorizing the features within the motion path of objects between two frames.
arXiv Detail & Related papers (2022-07-21T01:44:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.