Scalable Video Object Segmentation with Simplified Framework
- URL: http://arxiv.org/abs/2308.09903v1
- Date: Sat, 19 Aug 2023 04:30:48 GMT
- Title: Scalable Video Object Segmentation with Simplified Framework
- Authors: Qiangqiang Wu and Tianyu Yang and Wei WU and Antoni Chan
- Abstract summary: This paper presents a scalable VOS (SimVOS) framework to perform joint feature extraction and matching.
SimVOS employs a scalable ViT backbone for simultaneous feature extraction and matching between query and reference features.
Experimentally, our SimVOS achieves state-of-the-art results on popular video object segmentation benchmarks.
- Score: 21.408446548059956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The current popular methods for video object segmentation (VOS) implement
feature matching through several hand-crafted modules that separately perform
feature extraction and matching. However, the above hand-crafted designs
empirically cause insufficient target interaction, thus limiting the dynamic
target-aware feature learning in VOS. To tackle these limitations, this paper
presents a scalable Simplified VOS (SimVOS) framework to perform joint feature
extraction and matching by leveraging a single transformer backbone.
Specifically, SimVOS employs a scalable ViT backbone for simultaneous feature
extraction and matching between query and reference features. This design
enables SimVOS to learn better target-ware features for accurate mask
prediction. More importantly, SimVOS could directly apply well-pretrained ViT
backbones (e.g., MAE) for VOS, which bridges the gap between VOS and
large-scale self-supervised pre-training. To achieve a better performance-speed
trade-off, we further explore within-frame attention and propose a new token
refinement module to improve the running speed and save computational cost.
Experimentally, our SimVOS achieves state-of-the-art results on popular video
object segmentation benchmarks, i.e., DAVIS-2017 (88.0% J&F), DAVIS-2016 (92.9%
J&F) and YouTube-VOS 2019 (84.2% J&F), without applying any synthetic video or
BL30K pre-training used in previous VOS approaches.
Related papers
- OneVOS: Unifying Video Object Segmentation with All-in-One Transformer
Framework [24.947436083365925]
OneVOS is a novel framework that unifies the core components of VOS with All-in-One Transformer.
OneVOS achieves state-of-the-art performance across 7 datasets, particularly excelling in complex LVOS and MOSE datasets with 70.1% and 66.4% $J & F$, surpassing previous state-of-the-art methods by 4.2% and 7.0%, respectively.
arXiv Detail & Related papers (2024-03-13T16:38:26Z) - 1st Place Solution for 5th LSVOS Challenge: Referring Video Object
Segmentation [65.45702890457046]
We integrate strengths of leading RVOS models to build up an effective paradigm.
To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy.
Our method achieves 75.7% J&F on Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place on 5th Large-scale Video Object Challenge (ICCV 2023) track 3.
arXiv Detail & Related papers (2024-01-01T04:24:48Z) - DVIS++: Improved Decoupled Framework for Universal Video Segmentation [30.703276476607545]
We present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
arXiv Detail & Related papers (2023-12-20T03:01:33Z) - Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS)
Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos.
Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z) - Learning Cross-Modal Affinity for Referring Video Object Segmentation
Targeting Limited Samples [61.66967790884943]
Referring video object segmentation (RVOS) relies on sufficient data for a given scene.
In more realistic scenarios, only minimal annotations are available for a new scene.
We propose a model with a newly designed cross-modal affinity (CMA) module based on a Transformer architecture.
CMA module builds multimodal affinity with a few samples, thus quickly learning new semantic information, and enabling the model to adapt to different scenarios.
arXiv Detail & Related papers (2023-09-05T08:34:23Z) - Region Aware Video Object Segmentation with Deep Motion Modeling [56.95836951559529]
Region Aware Video Object (RAVOS) is a method that predicts regions of interest for efficient object segmentation and memory storage.
For efficient segmentation, object features are extracted according to the ROIs, and an object decoder is designed for object-level segmentation.
For efficient memory storage, we propose motion path memory to filter out redundant context by memorizing the features within the motion path of objects between two frames.
arXiv Detail & Related papers (2022-07-21T01:44:40Z) - SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation [24.884078497381633]
We introduce a Transformer-based approach to video object segmentation (VOS)
Our attention-based approach allows a model to learn to attend over a history features of multiple frames.
Our method achieves competitive results on YouTube-VOS and DAVIS 2017 with improved scalability and robustness compared with the state of the art.
arXiv Detail & Related papers (2021-01-21T20:06:12Z) - Learning Dynamic Network Using a Reuse Gate Function in Semi-supervised
Video Object Segmentation [27.559093073097483]
Current approaches for Semi-supervised Video Object (Semi-VOS) propagates information from previous frames to generate segmentation mask for the current frame.
We exploit this observation by using temporal information to quickly identify frames with minimal change.
We propose a novel dynamic network that estimates change across frames and decides which path -- computing a full network or reusing previous frame's feature -- to choose.
arXiv Detail & Related papers (2020-12-21T19:40:17Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z) - Learning Video Object Segmentation from Unlabeled Videos [158.18207922363783]
We propose a new method for video object segmentation (VOS) that addresses object pattern learning from unlabeled videos.
We introduce a unified unsupervised/weakly supervised learning framework, called MuG, that comprehensively captures properties of VOS at multiple granularities.
arXiv Detail & Related papers (2020-03-10T22:12:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.