DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries
- URL: http://arxiv.org/abs/2404.00086v5
- Date: Fri, 19 Jul 2024 09:07:20 GMT
- Title: DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries
- Authors: Yikang Zhou, Tao Zhang, Shunping Ji, Shuicheng Yan, Xiangtai Li,
- Abstract summary: We introduce Dynamic Anchor Queries (DAQ) to shorten the transition gap between the anchor and target queries.
We also introduce a query-level object Emergence and Disappearance Simulation (EDS) strategy, which unleashes DAQ's potential without any additional cost.
Experiments demonstrate that DVIS-DAQ achieves a new state-of-the-art (SOTA) performance on five mainstream video segmentation benchmarks.
- Score: 60.09774333024783
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern video segmentation methods adopt object queries to perform inter-frame association and demonstrate satisfactory performance in tracking continuously appearing objects despite large-scale motion and transient occlusion. However, they all underperform on newly emerging and disappearing objects that are common in the real world because they attempt to model object emergence and disappearance through feature transitions between background and foreground queries that have significant feature gaps. We introduce Dynamic Anchor Queries (DAQ) to shorten the transition gap between the anchor and target queries by dynamically generating anchor queries based on the features of potential candidates. Furthermore, we introduce a query-level object Emergence and Disappearance Simulation (EDS) strategy, which unleashes DAQ's potential without any additional cost. Finally, we combine our proposed DAQ and EDS with DVIS to obtain DVIS-DAQ. Extensive experiments demonstrate that DVIS-DAQ achieves a new state-of-the-art (SOTA) performance on five mainstream video segmentation benchmarks. Code and models are available at \url{https://github.com/SkyworkAI/DAQ-VS}.
Related papers
- DynamicER: Resolving Emerging Mentions to Dynamic Entities for RAG [40.40642344216866]
We introduce a novel task aimed at resolving emerging mentions to dynamic entities.
Our benchmark includes dynamic entity mention resolution and entity-centric knowledge-intensive QA task.
We propose a temporal segmented clustering method with continual adaptation, effectively managing the temporal dynamics of evolving entities and emerging mentions.
arXiv Detail & Related papers (2024-10-15T10:57:12Z) - Learning Spatial-Semantic Features for Robust Video Object Segmentation [108.045326229865]
We propose a robust video object segmentation framework equipped with spatial-semantic features and discriminative object queries.
We show that the proposed method set a new state-of-the-art performance on multiple datasets.
arXiv Detail & Related papers (2024-07-10T15:36:00Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement [39.154059294954614]
Online Video Instance (VIS) methods have shown remarkable advancement with their powerful query-based detectors.
However, our observations demonstrate that these methods heavily rely on location information, which often causes incorrect associations between objects.
This paper presents that a key axis of object matching in trackers is appearance information, which becomes greatly instructive under conditions where positional cues are insufficient for distinguishing their identities.
arXiv Detail & Related papers (2023-12-08T07:48:03Z) - Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS)
Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos.
Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z) - FODVid: Flow-guided Object Discovery in Videos [12.792602427704395]
We focus on building a generalizable solution that avoids overfitting to the individual intricacies.
To solve Video Object (VOS) in an unsupervised setting, we propose a new pipeline (FODVid) based on the idea of guiding segmentation outputs.
arXiv Detail & Related papers (2023-07-10T07:55:42Z) - Semantic-aware Dynamic Retrospective-Prospective Reasoning for
Event-level Video Question Answering [14.659023742381777]
Event-Level Video Question Answering (EVQA) requires complex reasoning across video events to provide optimal answers.
We propose a semantic-aware dynamic retrospective-prospective reasoning approach for video-based question answering.
Our proposed approach achieves superior performance compared to previous state-of-the-art models.
arXiv Detail & Related papers (2023-05-14T03:57:11Z) - Scalable Video Object Segmentation with Identification Mechanism [125.4229430216776]
This paper explores the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object (VOS)
We present two innovative approaches, Associating Objects with Transformers (AOT) and Associating Objects with Scalable Transformers (AOST)
Our approaches surpass the state-of-the-art competitors and display exceptional efficiency and scalability consistently across all six benchmarks.
arXiv Detail & Related papers (2022-03-22T03:33:27Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.