LSVOS Challenge 3rd Place Report: SAM2 and Cutie based VOS
- URL: http://arxiv.org/abs/2408.10469v2
- Date: Wed, 21 Aug 2024 00:39:38 GMT
- Title: LSVOS Challenge 3rd Place Report: SAM2 and Cutie based VOS
- Authors: Xinyu Liu, Jing Zhang, Kexin Zhang, Xu Liu, Lingling Li,
- Abstract summary: We combine the strengths of the state-of-the-art (SOTA) models SAM2 and Cutie to address these challenges.
Our approach achieves a J&F score of 0.7952 in the testing phase of LSVOS challenge VOS track, ranking third overall.
- Score: 25.894649323139987
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Object Segmentation (VOS) presents several challenges, including object occlusion and fragmentation, the dis-appearance and re-appearance of objects, and tracking specific objects within crowded scenes. In this work, we combine the strengths of the state-of-the-art (SOTA) models SAM2 and Cutie to address these challenges. Additionally, we explore the impact of various hyperparameters on video instance segmentation performance. Our approach achieves a J\&F score of 0.7952 in the testing phase of LSVOS challenge VOS track, ranking third overall.
Related papers
- MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes [137.1500445443403]
Video object segmentation (VOS) aims to segment specified target objects throughout a video.<n>To advance VOS toward more realistic environments, coMplex video Object SEgmentation (MOSEv1) was introduced.<n>We present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions.
arXiv Detail & Related papers (2025-08-07T17:59:27Z) - MOVE: Motion-Guided Few-Shot Video Object Segmentation [25.624419551994354]
This work addresses motion-guided few-shot video object segmentation (FSVOS)<n>It aims to segment dynamic objects in videos based on a few annotated examples with the same motion patterns.<n>We introduce MOVE, a large-scale dataset specifically designed for motion-guided FSVOS.
arXiv Detail & Related papers (2025-07-29T17:59:35Z) - FVOS for MOSE Track of 4th PVUW Challenge: 3rd Place Solution [2.9149767401557574]
Video Object PV (VOS) is one of the most fundamental and challenging tasks in computer vision.
This paper aims to achieve accurate segmentation of video objects in challenging scenes.
arXiv Detail & Related papers (2025-04-13T10:14:19Z) - LSVOS Challenge Report: Large-scale Complex and Long Video Object Segmentation [124.50550604020684]
This paper introduces the 6th Large-scale Video Object (LSVOS) challenge in conjunction with ECCV 2024 workshop.
This year's challenge includes two tasks: Video Object (VOS) and Referring Video Object (RVOS)
This year's challenge attracted 129 registered teams from more than 20 institutes across over 8 countries.
arXiv Detail & Related papers (2024-09-09T17:45:45Z) - Discriminative Spatial-Semantic VOS Solution: 1st Place Solution for 6th LSVOS [68.47681139026666]
Video object segmentation (VOS) is a crucial task in computer vision.
Current VOS methods struggle with complex scenes and prolonged object motions.
This report introduces a discriminative spatial-temporal VOS model.
arXiv Detail & Related papers (2024-08-29T10:47:17Z) - Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track [28.52754012142431]
Segment Anything Model 2 (SAM 2) is a foundation model towards solving promptable visual segmentation in images and videos.
SAM 2 builds a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date.
Without fine-tuning on the training set, SAM 2 achieved 75.79 J&F on the test set and ranked 4th place for 6th LSVOS Challenge VOS Track.
arXiv Detail & Related papers (2024-08-19T16:13:14Z) - 3D-Aware Instance Segmentation and Tracking in Egocentric Videos [107.10661490652822]
Egocentric videos present unique challenges for 3D scene understanding.
This paper introduces a novel approach to instance segmentation and tracking in first-person video.
By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches.
arXiv Detail & Related papers (2024-08-19T10:08:25Z) - 3rd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation [63.199793919573295]
Video Object (VOS) is a vital task in computer vision, focusing on distinguishing foreground objects from the background across video frames.
Our work draws inspiration from the Cutie model, and we investigate the effects of object memory, the total number of memory frames, and input resolution on segmentation performance.
arXiv Detail & Related papers (2024-06-06T00:56:25Z) - Tracking through Containers and Occluders in the Wild [32.86030395660071]
We introduce $textbfTCOW$, a new benchmark and model for visual tracking through heavy occlusion and containment.
We create a mixture of synthetic and annotated real datasets to support both supervised learning and structured evaluation of model performance.
We evaluate two recent transformer-based video models and find that while they can be surprisingly capable of tracking targets under certain settings of task variation, there remains a considerable performance gap before we can claim a tracking model to have acquired a true notion of object permanence.
arXiv Detail & Related papers (2023-05-04T17:59:58Z) - MOSE: A New Dataset for Video Object Segmentation in Complex Scenes [106.64327718262764]
Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence.
The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J&F) on existing datasets.
We collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex environments.
arXiv Detail & Related papers (2023-02-03T17:20:03Z) - Scalable Video Object Segmentation with Identification Mechanism [125.4229430216776]
This paper explores the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object (VOS)
We present two innovative approaches, Associating Objects with Transformers (AOT) and Associating Objects with Scalable Transformers (AOST)
Our approaches surpass the state-of-the-art competitors and display exceptional efficiency and scalability consistently across all six benchmarks.
arXiv Detail & Related papers (2022-03-22T03:33:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.