Joint Modeling of Feature, Correspondence, and a Compressed Memory for
Video Object Segmentation
- URL: http://arxiv.org/abs/2308.13505v1
- Date: Fri, 25 Aug 2023 17:30:08 GMT
- Title: Joint Modeling of Feature, Correspondence, and a Compressed Memory for
Video Object Segmentation
- Authors: Jiaming Zhang, Yutao Cui, Gangshan Wu, Limin Wang
- Abstract summary: Current prevailing Video Object (VOS) methods usually perform dense matching between the current and reference frames after extracting features.
We propose a unified VOS framework, coined as JointFormer, for joint modeling of the three elements of feature, correspondence, and a compressed memory.
- Score: 52.11279360934703
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current prevailing Video Object Segmentation (VOS) methods usually perform
dense matching between the current and reference frames after extracting their
features. One on hand, the decoupled modeling restricts the targets information
propagation only at high-level feature space. On the other hand, the pixel-wise
matching leads to a lack of holistic understanding of the targets. To overcome
these issues, we propose a unified VOS framework, coined as JointFormer, for
joint modeling the three elements of feature, correspondence, and a compressed
memory. The core design is the Joint Block, utilizing the flexibility of
attention to simultaneously extract feature and propagate the targets
information to the current tokens and the compressed memory token. This scheme
allows to perform extensive information propagation and discriminative feature
learning. To incorporate the long-term temporal targets information, we also
devise a customized online updating mechanism for the compressed memory token,
which can prompt the information flow along the temporal dimension and thus
improve the global modeling capability. Under the design, our method achieves a
new state-of-art performance on DAVIS 2017 val/test-dev (89.7% and 87.6%) and
YouTube-VOS 2018/2019 val (87.0% and 87.0%) benchmarks, outperforming existing
works by a large margin.
Related papers
- Learning Spatial-Semantic Features for Robust Video Object Segmentation [108.045326229865]
We propose a robust video object segmentation framework equipped with spatial-semantic features and discriminative object queries.
We show that the proposed method set a new state-of-the-art performance on multiple datasets.
arXiv Detail & Related papers (2024-07-10T15:36:00Z) - DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation [0.4487265603408873]
We present DeVOS (Deformable VOS), an architecture for Video Object that combines memory-based matching with motion-guided propagation.
Our method achieves top-rank performance on DAVIS 2017 val and test-dev (88.1%, 83.0%), YouTube-VOS 2019 val (86.6%)
arXiv Detail & Related papers (2024-05-11T14:57:22Z) - Spatial-Temporal Multi-level Association for Video Object Segmentation [89.32226483171047]
This paper proposes spatial-temporal multi-level association, which jointly associates reference frame, test frame, and object features.
Specifically, we construct a spatial-temporal multi-level feature association module to learn better target-aware features.
arXiv Detail & Related papers (2024-04-09T12:44:34Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - Look Before You Match: Instance Understanding Matters in Video Object
Segmentation [114.57723592870097]
In this paper, we argue that instance matters in video object segmentation (VOS)
We present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal matching with the memory bank.
We employ well-learned object queries from IS branch to inject instance-specific information into the query key, with which the instance-auged matching is further performed.
arXiv Detail & Related papers (2022-12-13T18:59:59Z) - MUNet: Motion Uncertainty-aware Semi-supervised Video Object
Segmentation [31.100954335785026]
We advocate the return of the emphmotion information and propose a motion uncertainty-aware framework (MUNet) for semi-supervised video object segmentation.
We introduce a motion-aware spatial attention module to effectively fuse the motion feature with the semantic feature.
We achieve $76.5%$ $mathcalJ & mathcalF$ only using DAVIS17 for training, which significantly outperforms the textitSOTA methods under the low-data protocol.
arXiv Detail & Related papers (2021-11-29T16:01:28Z) - Learning Dynamic Compact Memory Embedding for Deformable Visual Object
Tracking [82.34356879078955]
We propose a compact memory embedding to enhance the discrimination of the segmentation-based deformable visual tracking method.
Our method outperforms the excellent segmentation-based trackers, i.e., D3S and SiamMask on DAVIS 2017 benchmark.
arXiv Detail & Related papers (2021-11-23T03:07:12Z) - PMVOS: Pixel-Level Matching-Based Video Object Segmentation [9.357153487612965]
Semi-supervised video object segmentation (VOS) aims to segment arbitrary target objects in video when the ground truth segmentation mask of the initial frame is provided.
Recent pixel-level matching (PM) has been widely used for feature matching because of its high performance.
We propose a novel method-PM-based video object segmentation (PMVOS)-that constructs strong template features containing the information of all past frames.
arXiv Detail & Related papers (2020-09-18T14:22:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.