Unified Mask Embedding and Correspondence Learning for Self-Supervised
Video Segmentation
- URL: http://arxiv.org/abs/2303.10100v1
- Date: Fri, 17 Mar 2023 16:23:36 GMT
- Title: Unified Mask Embedding and Correspondence Learning for Self-Supervised
Video Segmentation
- Authors: Liulei Li, Wenguan Wang, Tianfei Zhou, Jianwu Li, Yi Yang
- Abstract summary: We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning.
It is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos.
Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS)
- Score: 76.40565872257709
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The objective of this paper is self-supervised learning of video object
segmentation. We develop a unified framework which simultaneously models
cross-frame dense correspondence for locally discriminative feature learning
and embeds object-level context for target-mask decoding. As a result, it is
able to directly learn to perform mask-guided sequential segmentation from
unlabeled videos, in contrast to previous efforts usually relying on an oblique
solution - cheaply "copying" labels according to pixel-wise correlations.
Concretely, our algorithm alternates between i) clustering video pixels for
creating pseudo segmentation labels ex nihilo; and ii) utilizing the pseudo
labels to learn mask encoding and decoding for VOS. Unsupervised correspondence
learning is further incorporated into this self-taught, mask embedding scheme,
so as to ensure the generic nature of the learnt representation and avoid
cluster degeneracy. Our algorithm sets state-of-the-arts on two standard
benchmarks (i.e., DAVIS17 and YouTube-VOS), narrowing the gap between self- and
fully-supervised VOS, in terms of both performance and network architecture
design.
Related papers
- Pseudo Labelling for Enhanced Masked Autoencoders [27.029542823306866]
We propose an enhanced approach that boosts Masked Autoencoders (MAE) performance by integrating pseudo labelling for both class and data tokens.
This strategy uses cluster assignments as pseudo labels to promote instance-level discrimination within the network.
We show that incorporating pseudo-labelling as an auxiliary task has demonstrated notable improvements in ImageNet-1K and other downstream tasks.
arXiv Detail & Related papers (2024-06-25T10:41:45Z) - Boosting Semantic Segmentation from the Perspective of Explicit Class
Embeddings [19.997929884477628]
We explore the mechanism of class embeddings and have an insight that more explicit and meaningful class embeddings can be generated based on class masks purposely.
We propose ECENet, a new segmentation paradigm, in which class embeddings are obtained and enhanced explicitly during interacting with multi-stage image features.
Our ECENet outperforms its counterparts on the ADE20K dataset with much less computational cost and achieves new state-of-the-art results on PASCAL-Context dataset.
arXiv Detail & Related papers (2023-08-24T16:16:10Z) - Boosting Video Object Segmentation via Space-time Correspondence
Learning [48.8275459383339]
Current solutions for video object segmentation (VOS) typically follow a matching-based regime.
We devise a correspondence-aware training framework, which boosts matching-based VOS solutions by explicitly encouraging robust correspondence matching.
Our algorithm provides solid performance gains on four widely used benchmarks.
arXiv Detail & Related papers (2023-04-13T01:34:44Z) - Towards Robust Video Object Segmentation with Adaptive Object
Calibration [18.094698623128146]
Video object segmentation (VOS) aims at segmenting objects in all target frames of a video, given annotated object masks of reference frames.
We propose a new deep network, which can adaptively construct object representations and calibrate object masks to achieve stronger robustness.
Our model achieves the state-of-the-art performance among existing published works, and also exhibits superior robustness against perturbations.
arXiv Detail & Related papers (2022-07-02T17:51:29Z) - CenterCLIP: Token Clustering for Efficient Text-Video Retrieval [67.21528544724546]
In CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive frames in videos.
This significantly increases computation costs and hinders the deployment of video retrieval models in web applications.
In this paper, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
arXiv Detail & Related papers (2022-05-02T12:02:09Z) - Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised
Correspondence Learning [74.03651142051656]
We develop LIIR, a locality-aware inter-and intra-video reconstruction framework.
We exploit cross video affinities as extra negative samples within a unified, inter-and intra-video reconstruction scheme.
arXiv Detail & Related papers (2022-03-27T15:46:42Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.