Unified Mask Embedding and Correspondence Learning for Self-Supervised
Video Segmentation
- URL: http://arxiv.org/abs/2303.10100v1
- Date: Fri, 17 Mar 2023 16:23:36 GMT
- Title: Unified Mask Embedding and Correspondence Learning for Self-Supervised
Video Segmentation
- Authors: Liulei Li, Wenguan Wang, Tianfei Zhou, Jianwu Li, Yi Yang
- Abstract summary: We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning.
It is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos.
Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS)
- Score: 76.40565872257709
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The objective of this paper is self-supervised learning of video object
segmentation. We develop a unified framework which simultaneously models
cross-frame dense correspondence for locally discriminative feature learning
and embeds object-level context for target-mask decoding. As a result, it is
able to directly learn to perform mask-guided sequential segmentation from
unlabeled videos, in contrast to previous efforts usually relying on an oblique
solution - cheaply "copying" labels according to pixel-wise correlations.
Concretely, our algorithm alternates between i) clustering video pixels for
creating pseudo segmentation labels ex nihilo; and ii) utilizing the pseudo
labels to learn mask encoding and decoding for VOS. Unsupervised correspondence
learning is further incorporated into this self-taught, mask embedding scheme,
so as to ensure the generic nature of the learnt representation and avoid
cluster degeneracy. Our algorithm sets state-of-the-arts on two standard
benchmarks (i.e., DAVIS17 and YouTube-VOS), narrowing the gap between self- and
fully-supervised VOS, in terms of both performance and network architecture
design.
Related papers
- Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels [53.8817160001038]
We propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding.
To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm.
PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods.
arXiv Detail & Related papers (2024-09-30T01:13:03Z) - Pseudo Labelling for Enhanced Masked Autoencoders [27.029542823306866]
We propose an enhanced approach that boosts Masked Autoencoders (MAE) performance by integrating pseudo labelling for both class and data tokens.
This strategy uses cluster assignments as pseudo labels to promote instance-level discrimination within the network.
We show that incorporating pseudo-labelling as an auxiliary task has demonstrated notable improvements in ImageNet-1K and other downstream tasks.
arXiv Detail & Related papers (2024-06-25T10:41:45Z) - Boosting Video Object Segmentation via Space-time Correspondence
Learning [48.8275459383339]
Current solutions for video object segmentation (VOS) typically follow a matching-based regime.
We devise a correspondence-aware training framework, which boosts matching-based VOS solutions by explicitly encouraging robust correspondence matching.
Our algorithm provides solid performance gains on four widely used benchmarks.
arXiv Detail & Related papers (2023-04-13T01:34:44Z) - Towards Robust Video Object Segmentation with Adaptive Object
Calibration [18.094698623128146]
Video object segmentation (VOS) aims at segmenting objects in all target frames of a video, given annotated object masks of reference frames.
We propose a new deep network, which can adaptively construct object representations and calibrate object masks to achieve stronger robustness.
Our model achieves the state-of-the-art performance among existing published works, and also exhibits superior robustness against perturbations.
arXiv Detail & Related papers (2022-07-02T17:51:29Z) - Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised
Correspondence Learning [74.03651142051656]
We develop LIIR, a locality-aware inter-and intra-video reconstruction framework.
We exploit cross video affinities as extra negative samples within a unified, inter-and intra-video reconstruction scheme.
arXiv Detail & Related papers (2022-03-27T15:46:42Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.