OneVOS: Unifying Video Object Segmentation with All-in-One Transformer
Framework
- URL: http://arxiv.org/abs/2403.08682v1
- Date: Wed, 13 Mar 2024 16:38:26 GMT
- Title: OneVOS: Unifying Video Object Segmentation with All-in-One Transformer
Framework
- Authors: Wanyun Li, Pinxue Guo, Xinyu Zhou, Lingyi Hong, Yangji He, Xiangyu
Zheng, Wei Zhang and Wenqiang Zhang
- Abstract summary: OneVOS is a novel framework that unifies the core components of VOS with All-in-One Transformer.
OneVOS achieves state-of-the-art performance across 7 datasets, particularly excelling in complex LVOS and MOSE datasets with 70.1% and 66.4% $J & F$, surpassing previous state-of-the-art methods by 4.2% and 7.0%, respectively.
- Score: 24.947436083365925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contemporary Video Object Segmentation (VOS) approaches typically consist
stages of feature extraction, matching, memory management, and multiple objects
aggregation. Recent advanced models either employ a discrete modeling for these
components in a sequential manner, or optimize a combined pipeline through
substructure aggregation. However, these existing explicit staged approaches
prevent the VOS framework from being optimized as a unified whole, leading to
the limited capacity and suboptimal performance in tackling complex videos. In
this paper, we propose OneVOS, a novel framework that unifies the core
components of VOS with All-in-One Transformer. Specifically, to unify all
aforementioned modules into a vision transformer, we model all the features of
frames, masks and memory for multiple objects as transformer tokens, and
integrally accomplish feature extraction, matching and memory management of
multiple objects through the flexible attention mechanism. Furthermore, a
Unidirectional Hybrid Attention is proposed through a double decoupling of the
original attention operation, to rectify semantic errors and ambiguities of
stored tokens in OneVOS framework. Finally, to alleviate the storage burden and
expedite inference, we propose the Dynamic Token Selector, which unveils the
working mechanism of OneVOS and naturally leads to a more efficient version of
OneVOS. Extensive experiments demonstrate the superiority of OneVOS, achieving
state-of-the-art performance across 7 datasets, particularly excelling in
complex LVOS and MOSE datasets with 70.1% and 66.4% $J \& F$ scores, surpassing
previous state-of-the-art methods by 4.2% and 7.0%, respectively. And our code
will be available for reproducibility and further research.
Related papers
- Video Object Segmentation with Dynamic Query Modulation [23.811776213359625]
We propose a query modulation method, termed QMVOS, for object and multi-object segmentation.
Our method can bring significant improvements to the memory-based SVOS method and achieve competitive performance on standard SVOS benchmarks.
arXiv Detail & Related papers (2024-03-18T07:31:39Z) - 1st Place Solution for 5th LSVOS Challenge: Referring Video Object
Segmentation [65.45702890457046]
We integrate strengths of leading RVOS models to build up an effective paradigm.
To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy.
Our method achieves 75.7% J&F on Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place on 5th Large-scale Video Object Challenge (ICCV 2023) track 3.
arXiv Detail & Related papers (2024-01-01T04:24:48Z) - Scalable Video Object Segmentation with Simplified Framework [21.408446548059956]
This paper presents a scalable VOS (SimVOS) framework to perform joint feature extraction and matching.
SimVOS employs a scalable ViT backbone for simultaneous feature extraction and matching between query and reference features.
Experimentally, our SimVOS achieves state-of-the-art results on popular video object segmentation benchmarks.
arXiv Detail & Related papers (2023-08-19T04:30:48Z) - Look Before You Match: Instance Understanding Matters in Video Object
Segmentation [114.57723592870097]
In this paper, we argue that instance matters in video object segmentation (VOS)
We present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal matching with the memory bank.
We employ well-learned object queries from IS branch to inject instance-specific information into the query key, with which the instance-auged matching is further performed.
arXiv Detail & Related papers (2022-12-13T18:59:59Z) - Region Aware Video Object Segmentation with Deep Motion Modeling [56.95836951559529]
Region Aware Video Object (RAVOS) is a method that predicts regions of interest for efficient object segmentation and memory storage.
For efficient segmentation, object features are extracted according to the ROIs, and an object decoder is designed for object-level segmentation.
For efficient memory storage, we propose motion path memory to filter out redundant context by memorizing the features within the motion path of objects between two frames.
arXiv Detail & Related papers (2022-07-21T01:44:40Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Scalable Video Object Segmentation with Identification Mechanism [125.4229430216776]
This paper explores the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object (VOS)
We present two innovative approaches, Associating Objects with Transformers (AOT) and Associating Objects with Scalable Transformers (AOST)
Our approaches surpass the state-of-the-art competitors and display exceptional efficiency and scalability consistently across all six benchmarks.
arXiv Detail & Related papers (2022-03-22T03:33:27Z) - TransVOS: Video Object Segmentation with Transformers [13.311777431243296]
We propose a vision transformer to fully exploit and model both the temporal and spatial relationships.
To slim the popular two-encoder pipeline, we design a single two-path feature extractor.
Experiments demonstrate the superiority of our TransVOS over state-of-the-art methods on both DAVIS and YouTube-VOS datasets.
arXiv Detail & Related papers (2021-06-01T15:56:10Z) - Video Instance Segmentation with a Propose-Reduce Paradigm [68.59137660342326]
Video instance segmentation (VIS) aims to segment and associate all instances of predefined classes for each frame in videos.
Prior methods usually obtain segmentation for a frame or clip first, and then merge the incomplete results by tracking or matching.
We propose a new paradigm -- Propose-Reduce, to generate complete sequences for input videos by a single step.
arXiv Detail & Related papers (2021-03-25T10:58:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.