DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation
- URL: http://arxiv.org/abs/2405.08715v1
- Date: Sat, 11 May 2024 14:57:22 GMT
- Title: DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation
- Authors: Volodymyr Fedynyak, Yaroslav Romanus, Bohdan Hlovatskyi, Bohdan Sydor, Oles Dobosevych, Igor Babin, Roman Riazantsev,
- Abstract summary: We present DeVOS (Deformable VOS), an architecture for Video Object that combines memory-based matching with motion-guided propagation.
Our method achieves top-rank performance on DAVIS 2017 val and test-dev (88.1%, 83.0%), YouTube-VOS 2019 val (86.6%)
- Score: 0.4487265603408873
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent works on Video Object Segmentation achieved remarkable results by matching dense semantic and instance-level features between the current and previous frames for long-time propagation. Nevertheless, global feature matching ignores scene motion context, failing to satisfy temporal consistency. Even though some methods introduce local matching branch to achieve smooth propagation, they fail to model complex appearance changes due to the constraints of the local window. In this paper, we present DeVOS (Deformable VOS), an architecture for Video Object Segmentation that combines memory-based matching with motion-guided propagation resulting in stable long-term modeling and strong temporal consistency. For short-term local propagation, we propose a novel attention mechanism ADVA (Adaptive Deformable Video Attention), allowing the adaption of similarity search region to query-specific semantic features, which ensures robust tracking of complex shape and scale changes. DeVOS employs an optical flow to obtain scene motion features which are further injected to deformable attention as strong priors to learnable offsets. Our method achieves top-rank performance on DAVIS 2017 val and test-dev (88.1%, 83.0%), YouTube-VOS 2019 val (86.6%) while featuring consistent run-time speed and stable memory consumption
Related papers
- Global Motion Understanding in Large-Scale Video Object Segmentation [0.499320937849508]
We show that transferring knowledge from other domains of video understanding combined with large-scale learning can improve robustness of Video Object (VOS) under complex circumstances.
Namely, we focus on integrating scene global motion knowledge to improve large-scale semi-supervised Video Object.
We present WarpFormer, an architecture for semi-supervised Video Object that exploits existing knowledge in motion understanding to conduct smoother propagation and more accurate matching.
arXiv Detail & Related papers (2024-05-11T15:09:22Z) - Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation.
Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames.
Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z) - Joint Modeling of Feature, Correspondence, and a Compressed Memory for
Video Object Segmentation [52.11279360934703]
Current prevailing Video Object (VOS) methods usually perform dense matching between the current and reference frames after extracting features.
We propose a unified VOS framework, coined as JointFormer, for joint modeling of the three elements of feature, correspondence, and a compressed memory.
arXiv Detail & Related papers (2023-08-25T17:30:08Z) - Look Before You Match: Instance Understanding Matters in Video Object
Segmentation [114.57723592870097]
In this paper, we argue that instance matters in video object segmentation (VOS)
We present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal matching with the memory bank.
We employ well-learned object queries from IS branch to inject instance-specific information into the query key, with which the instance-auged matching is further performed.
arXiv Detail & Related papers (2022-12-13T18:59:59Z) - Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework.
It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z) - SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation [24.884078497381633]
We introduce a Transformer-based approach to video object segmentation (VOS)
Our attention-based approach allows a model to learn to attend over a history features of multiple frames.
Our method achieves competitive results on YouTube-VOS and DAVIS 2017 with improved scalability and robustness compared with the state of the art.
arXiv Detail & Related papers (2021-01-21T20:06:12Z) - Learning Dynamic Network Using a Reuse Gate Function in Semi-supervised
Video Object Segmentation [27.559093073097483]
Current approaches for Semi-supervised Video Object (Semi-VOS) propagates information from previous frames to generate segmentation mask for the current frame.
We exploit this observation by using temporal information to quickly identify frames with minimal change.
We propose a novel dynamic network that estimates change across frames and decides which path -- computing a full network or reusing previous frame's feature -- to choose.
arXiv Detail & Related papers (2020-12-21T19:40:17Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.