Look Before You Match: Instance Understanding Matters in Video Object
Segmentation
- URL: http://arxiv.org/abs/2212.06826v1
- Date: Tue, 13 Dec 2022 18:59:59 GMT
- Title: Look Before You Match: Instance Understanding Matters in Video Object
Segmentation
- Authors: Junke Wang and Dongdong Chen and Zuxuan Wu and Chong Luo and Chuanxin
Tang and Xiyang Dai and Yucheng Zhao and Yujia Xie and Lu Yuan and Yu-Gang
Jiang
- Abstract summary: In this paper, we argue that instance matters in video object segmentation (VOS)
We present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal matching with the memory bank.
We employ well-learned object queries from IS branch to inject instance-specific information into the query key, with which the instance-auged matching is further performed.
- Score: 114.57723592870097
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Exploring dense matching between the current frame and past frames for
long-range context modeling, memory-based methods have demonstrated impressive
results in video object segmentation (VOS) recently. Nevertheless, due to the
lack of instance understanding ability, the above approaches are oftentimes
brittle to large appearance variations or viewpoint changes resulted from the
movement of objects and cameras. In this paper, we argue that instance
understanding matters in VOS, and integrating it with memory-based matching can
enjoy the synergy, which is intuitively sensible from the definition of VOS
task, \ie, identifying and segmenting object instances within the video.
Towards this goal, we present a two-branch network for VOS, where the
query-based instance segmentation (IS) branch delves into the instance details
of the current frame and the VOS branch performs spatial-temporal matching with
the memory bank. We employ the well-learned object queries from IS branch to
inject instance-specific information into the query key, with which the
instance-augmented matching is further performed. In addition, we introduce a
multi-path fusion block to effectively combine the memory readout with
multi-scale features from the instance segmentation decoder, which incorporates
high-resolution instance-aware features to produce final segmentation results.
Our method achieves state-of-the-art performance on DAVIS 2016/2017 val (92.6%
and 87.1%), DAVIS 2017 test-dev (82.8%), and YouTube-VOS 2018/2019 val (86.3%
and 86.3%), outperforming alternative methods by clear margins.
Related papers
- Context-Aware Video Instance Segmentation [12.71520768233772]
We introduce the Context-Aware Video Instance (CAVIS), a novel framework designed to enhance instance association.
We propose the Context-Aware Instance Tracker (CAIT), which merges contextual data surrounding the instances with the core instance features to improve tracking accuracy.
We also introduce the Prototypical Cross-frame Contrastive (PCC) loss, which ensures consistency in object-level features across frames.
arXiv Detail & Related papers (2024-07-03T11:11:16Z) - Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT)
We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information.
Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z) - DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation [0.4487265603408873]
We present DeVOS (Deformable VOS), an architecture for Video Object that combines memory-based matching with motion-guided propagation.
Our method achieves top-rank performance on DAVIS 2017 val and test-dev (88.1%, 83.0%), YouTube-VOS 2019 val (86.6%)
arXiv Detail & Related papers (2024-05-11T14:57:22Z) - Video Object Segmentation with Dynamic Query Modulation [23.811776213359625]
We propose a query modulation method, termed QMVOS, for object and multi-object segmentation.
Our method can bring significant improvements to the memory-based SVOS method and achieve competitive performance on standard SVOS benchmarks.
arXiv Detail & Related papers (2024-03-18T07:31:39Z) - VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement [39.154059294954614]
Online Video Instance (VIS) methods have shown remarkable advancement with their powerful query-based detectors.
However, our observations demonstrate that these methods heavily rely on location information, which often causes incorrect associations between objects.
This paper presents that a key axis of object matching in trackers is appearance information, which becomes greatly instructive under conditions where positional cues are insufficient for distinguishing their identities.
arXiv Detail & Related papers (2023-12-08T07:48:03Z) - Joint Modeling of Feature, Correspondence, and a Compressed Memory for
Video Object Segmentation [52.11279360934703]
Current prevailing Video Object (VOS) methods usually perform dense matching between the current and reference frames after extracting features.
We propose a unified VOS framework, coined as JointFormer, for joint modeling of the three elements of feature, correspondence, and a compressed memory.
arXiv Detail & Related papers (2023-08-25T17:30:08Z) - Region Aware Video Object Segmentation with Deep Motion Modeling [56.95836951559529]
Region Aware Video Object (RAVOS) is a method that predicts regions of interest for efficient object segmentation and memory storage.
For efficient segmentation, object features are extracted according to the ROIs, and an object decoder is designed for object-level segmentation.
For efficient memory storage, we propose motion path memory to filter out redundant context by memorizing the features within the motion path of objects between two frames.
arXiv Detail & Related papers (2022-07-21T01:44:40Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - End-to-end video instance segmentation via spatial-temporal graph neural
networks [30.748756362692184]
Video instance segmentation is a challenging task that extends image instance segmentation to the video domain.
Existing methods either rely only on single-frame information for the detection and segmentation subproblems or handle tracking as a separate post-processing step.
We propose a novel graph-neural-network (GNN) based method to handle the aforementioned limitation.
arXiv Detail & Related papers (2022-03-07T05:38:08Z) - CompFeat: Comprehensive Feature Aggregation for Video Instance
Segmentation [67.17625278621134]
Video instance segmentation is a complex task in which we need to detect, segment, and track each object for any given video.
Previous approaches only utilize single-frame features for the detection, segmentation, and tracking of objects.
We propose a novel comprehensive feature aggregation approach (CompFeat) to refine features at both frame-level and object-level with temporal and spatial context information.
arXiv Detail & Related papers (2020-12-07T00:31:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.