Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track
- URL: http://arxiv.org/abs/2408.10125v2
- Date: Sat, 24 Aug 2024 13:07:51 GMT
- Title: Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track
- Authors: Feiyu Pan, Hao Fang, Runmin Cong, Wei Zhang, Xiankai Lu,
- Abstract summary: Segment Anything Model 2 (SAM 2) is a foundation model towards solving promptable visual segmentation in images and videos.
SAM 2 builds a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date.
Without fine-tuning on the training set, SAM 2 achieved 75.79 J&F on the test set and ranked 4th place for 6th LSVOS Challenge VOS Track.
- Score: 28.52754012142431
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Object Segmentation (VOS) task aims to segmenting a particular object instance throughout the entire video sequence given only the object mask of the first frame. Recently, Segment Anything Model 2 (SAM 2) is proposed, which is a foundation model towards solving promptable visual segmentation in images and videos. SAM 2 builds a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. SAM 2 is a simple transformer architecture with streaming memory for real-time video processing, which trained on the date provides strong performance across a wide range of tasks. In this work, we evaluate the zero-shot performance of SAM 2 on the more challenging VOS datasets MOSE and LVOS. Without fine-tuning on the training set, SAM 2 achieved 75.79 J&F on the test set and ranked 4th place for 6th LSVOS Challenge VOS Track.
Related papers
- SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree [79.26409013413003]
We introduce SAM2Long, an improved training-free video object segmentation strategy.
It considers the segmentation uncertainty within each frame and chooses the video-level optimal results from multiple segmentation pathways.
SAM2Long achieves an average improvement of 3.0 points across all 24 head-to-head comparisons.
arXiv Detail & Related papers (2024-10-21T17:59:19Z) - When SAM2 Meets Video Camouflaged Object Segmentation: A Comprehensive Evaluation and Adaptation [36.174458990817165]
This study investigates the application and performance of the Segment Anything Model 2 (SAM2) in the challenging task of video camouflaged object segmentation (VCOS)
VCOS involves detecting objects that blend seamlessly in the surroundings for videos, due to similar colors and textures, poor light conditions, etc.
arXiv Detail & Related papers (2024-09-27T11:35:50Z) - From SAM to SAM 2: Exploring Improvements in Meta's Segment Anything Model [0.5639904484784127]
The Segment Anything Model (SAM) was introduced to the computer vision community by Meta in April 2023.
SAM excels in zero-shot performance, segmenting unseen objects without additional training, stimulated by a large dataset of over one billion image masks.
SAM 2 expands this functionality to video, leveraging memory from preceding and subsequent frames to generate accurate segmentation across entire videos.
arXiv Detail & Related papers (2024-08-12T17:17:35Z) - SAM 2: Segment Anything in Images and Videos [63.44869623822368]
We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos.
We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date.
Our model is a simple transformer architecture with streaming memory for real-time video processing.
arXiv Detail & Related papers (2024-08-01T17:00:08Z) - FocSAM: Delving Deeply into Focused Objects in Segmenting Anything [58.042354516491024]
The Segment Anything Model (SAM) marks a notable milestone in segmentation models.
We propose FocSAM with a pipeline redesigned on two pivotal aspects.
First, we propose Dynamic Window Multi-head Self-Attention (Dwin-MSA) to dynamically refocus SAM's image embeddings on the target object.
Second, we propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of interactive information from a few initial clicks.
arXiv Detail & Related papers (2024-05-29T02:34:13Z) - Moving Object Segmentation: All You Need Is SAM (and Flow) [82.78026782967959]
We investigate two models for combining SAM with optical flow that harness the segmentation power of SAM with the ability of flow to discover and group moving objects.
In the first model, we adapt SAM to take optical flow, rather than RGB, as an input. In the second, SAM takes RGB as an input, and flow is used as a segmentation prompt.
These surprisingly simple methods, without any further modifications, outperform all previous approaches by a considerable margin in both single and multi-object benchmarks.
arXiv Detail & Related papers (2024-04-18T17:59:53Z) - 1st Place Solution for 5th LSVOS Challenge: Referring Video Object
Segmentation [65.45702890457046]
We integrate strengths of leading RVOS models to build up an effective paradigm.
To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy.
Our method achieves 75.7% J&F on Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place on 5th Large-scale Video Object Challenge (ICCV 2023) track 3.
arXiv Detail & Related papers (2024-01-01T04:24:48Z) - MOSE: A New Dataset for Video Object Segmentation in Complex Scenes [106.64327718262764]
Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence.
The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J&F) on existing datasets.
We collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex environments.
arXiv Detail & Related papers (2023-02-03T17:20:03Z) - The Second Place Solution for The 4th Large-scale Video Object
Segmentation Challenge--Track 3: Referring Video Object Segmentation [18.630453674396534]
ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames.
This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference.
The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
arXiv Detail & Related papers (2022-06-24T02:15:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.