Boosting Video Object Segmentation via Space-time Correspondence
Learning
- URL: http://arxiv.org/abs/2304.06211v1
- Date: Thu, 13 Apr 2023 01:34:44 GMT
- Title: Boosting Video Object Segmentation via Space-time Correspondence
Learning
- Authors: Yurong Zhang, Liulei Li, Wenguan Wang, Rong Xie, Li Song, Wenjun Zhang
- Abstract summary: Current solutions for video object segmentation (VOS) typically follow a matching-based regime.
We devise a correspondence-aware training framework, which boosts matching-based VOS solutions by explicitly encouraging robust correspondence matching.
Our algorithm provides solid performance gains on four widely used benchmarks.
- Score: 48.8275459383339
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current top-leading solutions for video object segmentation (VOS) typically
follow a matching-based regime: for each query frame, the segmentation mask is
inferred according to its correspondence to previously processed and the first
annotated frames. They simply exploit the supervisory signals from the
groundtruth masks for learning mask prediction only, without posing any
constraint on the space-time correspondence matching, which, however, is the
fundamental building block of such regime. To alleviate this crucial yet
commonly ignored issue, we devise a correspondence-aware training framework,
which boosts matching-based VOS solutions by explicitly encouraging robust
correspondence matching during network learning. Through comprehensively
exploring the intrinsic coherence in videos on pixel and object levels, our
algorithm reinforces the standard, fully supervised training of mask
segmentation with label-free, contrastive correspondence learning. Without
neither requiring extra annotation cost during training, nor causing speed
delay during deployment, nor incurring architectural modification, our
algorithm provides solid performance gains on four widely used benchmarks,
i.e., DAVIS2016&2017, and YouTube-VOS2018&2019, on the top of famous
matching-based VOS solutions.
Related papers
- One-shot Training for Video Object Segmentation [11.52321103793505]
Video Object (VOS) aims to track objects across frames in a video and segment them based on the initial annotated frame of the target objects.
Previous VOS works typically rely on fully annotated videos for training.
We propose a general one-shot training framework for VOS, requiring only a single labeled frame per training video.
arXiv Detail & Related papers (2024-05-22T21:37:08Z) - Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS)
Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos.
Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z) - Temporal-aware Hierarchical Mask Classification for Video Semantic
Segmentation [62.275143240798236]
Video semantic segmentation dataset has limited categories per video.
Less than 10% of queries could be matched to receive meaningful gradient updates during VSS training.
Our method achieves state-of-the-art performance on the latest challenging VSS benchmark VSPW without bells and whistles.
arXiv Detail & Related papers (2023-09-14T20:31:06Z) - Unified Mask Embedding and Correspondence Learning for Self-Supervised
Video Segmentation [76.40565872257709]
We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning.
It is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos.
Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS)
arXiv Detail & Related papers (2023-03-17T16:23:36Z) - CenterCLIP: Token Clustering for Efficient Text-Video Retrieval [67.21528544724546]
In CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive frames in videos.
This significantly increases computation costs and hinders the deployment of video retrieval models in web applications.
In this paper, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
arXiv Detail & Related papers (2022-05-02T12:02:09Z) - Learning by Aligning Videos in Time [10.075645944474287]
We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task.
We leverage a novel combination of temporal alignment loss and temporal regularization terms, which can be used as supervision signals for training an encoder network.
arXiv Detail & Related papers (2021-03-31T17:55:52Z) - Learning Dynamic Network Using a Reuse Gate Function in Semi-supervised
Video Object Segmentation [27.559093073097483]
Current approaches for Semi-supervised Video Object (Semi-VOS) propagates information from previous frames to generate segmentation mask for the current frame.
We exploit this observation by using temporal information to quickly identify frames with minimal change.
We propose a novel dynamic network that estimates change across frames and decides which path -- computing a full network or reusing previous frame's feature -- to choose.
arXiv Detail & Related papers (2020-12-21T19:40:17Z) - Spatiotemporal Graph Neural Network based Mask Reconstruction for Video
Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting.
We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.