Joint Modeling of Feature, Correspondence, and a Compressed Memory for
Video Object Segmentation
- URL: http://arxiv.org/abs/2308.13505v1
- Date: Fri, 25 Aug 2023 17:30:08 GMT
- Title: Joint Modeling of Feature, Correspondence, and a Compressed Memory for
Video Object Segmentation
- Authors: Jiaming Zhang, Yutao Cui, Gangshan Wu, Limin Wang
- Abstract summary: Current prevailing Video Object (VOS) methods usually perform dense matching between the current and reference frames after extracting features.
We propose a unified VOS framework, coined as JointFormer, for joint modeling of the three elements of feature, correspondence, and a compressed memory.
- Score: 52.11279360934703
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current prevailing Video Object Segmentation (VOS) methods usually perform
dense matching between the current and reference frames after extracting their
features. One on hand, the decoupled modeling restricts the targets information
propagation only at high-level feature space. On the other hand, the pixel-wise
matching leads to a lack of holistic understanding of the targets. To overcome
these issues, we propose a unified VOS framework, coined as JointFormer, for
joint modeling the three elements of feature, correspondence, and a compressed
memory. The core design is the Joint Block, utilizing the flexibility of
attention to simultaneously extract feature and propagate the targets
information to the current tokens and the compressed memory token. This scheme
allows to perform extensive information propagation and discriminative feature
learning. To incorporate the long-term temporal targets information, we also
devise a customized online updating mechanism for the compressed memory token,
which can prompt the information flow along the temporal dimension and thus
improve the global modeling capability. Under the design, our method achieves a
new state-of-art performance on DAVIS 2017 val/test-dev (89.7% and 87.6%) and
YouTube-VOS 2018/2019 val (87.0% and 87.0%) benchmarks, outperforming existing
works by a large margin.
Related papers
- Scoring, Remember, and Reference: Catching Camouflaged Objects in Videos [24.03405963900272]
Video Camouflaged Object Detection aims to segment objects whose appearances closely resemble their surroundings.
Existing vision models often struggle in such scenarios due to the indistinguishable appearance of camouflaged objects.
We propose an end-to-end framework inspired by human memory-recognition.
arXiv Detail & Related papers (2025-03-21T11:08:14Z) - Spatio-temporal Graph Learning on Adaptive Mined Key Frames for High-performance Multi-Object Tracking [5.746443489229576]
Key Frame Extraction (KFE) module leverages reinforcement learning to adaptively segment videos.
Intra-Frame Feature Fusion (IFF) module uses a Graph Convolutional Network (GCN) to facilitate information exchange between the target and surrounding objects.
Our proposed tracker achieves impressive results on the MOT17 dataset.
arXiv Detail & Related papers (2025-01-17T11:36:38Z) - Learning Spatial-Semantic Features for Robust Video Object Segmentation [108.045326229865]
We propose a robust video object segmentation framework equipped with spatial-semantic features and discriminative object queries.
We show that the proposed method set a new state-of-the-art performance on multiple datasets.
arXiv Detail & Related papers (2024-07-10T15:36:00Z) - DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation [0.4487265603408873]
We present DeVOS (Deformable VOS), an architecture for Video Object that combines memory-based matching with motion-guided propagation.
Our method achieves top-rank performance on DAVIS 2017 val and test-dev (88.1%, 83.0%), YouTube-VOS 2019 val (86.6%)
arXiv Detail & Related papers (2024-05-11T14:57:22Z) - Spatial-Temporal Multi-level Association for Video Object Segmentation [89.32226483171047]
This paper proposes spatial-temporal multi-level association, which jointly associates reference frame, test frame, and object features.
Specifically, we construct a spatial-temporal multi-level feature association module to learn better target-aware features.
arXiv Detail & Related papers (2024-04-09T12:44:34Z) - Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation.
Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames.
Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z) - Deep Common Feature Mining for Efficient Video Semantic Segmentation [25.851900402539467]
We present Deep Common Feature Mining (DCFM) for video semantic segmentation.
DCFM explicitly decomposes features into two complementary components.
We incorporate a self-supervised loss function to reinforce intra-class feature similarity and enhance temporal consistency.
arXiv Detail & Related papers (2024-03-05T06:17:59Z) - Multi-grained Temporal Prototype Learning for Few-shot Video Object
Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images.
We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data.
Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - Look Before You Match: Instance Understanding Matters in Video Object
Segmentation [114.57723592870097]
In this paper, we argue that instance matters in video object segmentation (VOS)
We present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal matching with the memory bank.
We employ well-learned object queries from IS branch to inject instance-specific information into the query key, with which the instance-auged matching is further performed.
arXiv Detail & Related papers (2022-12-13T18:59:59Z) - MUNet: Motion Uncertainty-aware Semi-supervised Video Object
Segmentation [31.100954335785026]
We advocate the return of the emphmotion information and propose a motion uncertainty-aware framework (MUNet) for semi-supervised video object segmentation.
We introduce a motion-aware spatial attention module to effectively fuse the motion feature with the semantic feature.
We achieve $76.5%$ $mathcalJ & mathcalF$ only using DAVIS17 for training, which significantly outperforms the textitSOTA methods under the low-data protocol.
arXiv Detail & Related papers (2021-11-29T16:01:28Z) - Learning Dynamic Compact Memory Embedding for Deformable Visual Object
Tracking [82.34356879078955]
We propose a compact memory embedding to enhance the discrimination of the segmentation-based deformable visual tracking method.
Our method outperforms the excellent segmentation-based trackers, i.e., D3S and SiamMask on DAVIS 2017 benchmark.
arXiv Detail & Related papers (2021-11-23T03:07:12Z) - CompFeat: Comprehensive Feature Aggregation for Video Instance
Segmentation [67.17625278621134]
Video instance segmentation is a complex task in which we need to detect, segment, and track each object for any given video.
Previous approaches only utilize single-frame features for the detection, segmentation, and tracking of objects.
We propose a novel comprehensive feature aggregation approach (CompFeat) to refine features at both frame-level and object-level with temporal and spatial context information.
arXiv Detail & Related papers (2020-12-07T00:31:42Z) - PMVOS: Pixel-Level Matching-Based Video Object Segmentation [9.357153487612965]
Semi-supervised video object segmentation (VOS) aims to segment arbitrary target objects in video when the ground truth segmentation mask of the initial frame is provided.
Recent pixel-level matching (PM) has been widely used for feature matching because of its high performance.
We propose a novel method-PM-based video object segmentation (PMVOS)-that constructs strong template features containing the information of all past frames.
arXiv Detail & Related papers (2020-09-18T14:22:09Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.