Related papers: Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation

Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation

URL: http://arxiv.org/abs/2303.10100v1
Date: Fri, 17 Mar 2023 16:23:36 GMT
Title: Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation
Authors: Liulei Li, Wenguan Wang, Tianfei Zhou, Jianwu Li, Yi Yang
Abstract summary: We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning. It is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos. Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS)
Score: 76.40565872257709
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The objective of this paper is self-supervised learning of video object segmentation. We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning and embeds object-level context for target-mask decoding. As a result, it is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos, in contrast to previous efforts usually relying on an oblique solution - cheaply "copying" labels according to pixel-wise correlations. Concretely, our algorithm alternates between i) clustering video pixels for creating pseudo segmentation labels ex nihilo; and ii) utilizing the pseudo labels to learn mask encoding and decoding for VOS. Unsupervised correspondence learning is further incorporated into this self-taught, mask embedding scheme, so as to ensure the generic nature of the learnt representation and avoid cluster degeneracy. Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS), narrowing the gap between self- and fully-supervised VOS, in terms of both performance and network architecture design.

Related papers

Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation [8.659766913542938]
We study a united perceptual and semantic token compression for all granular understanding. We propose Feature Pyramid Tokenization (PAT) to cluster and represent multi-resolution feature by learnable codebooks. Our experiments show that PAT enhances the semantic intuition of VLM feature pyramid.
arXiv Detail & Related papers (2024-12-18T18:43:21Z)
Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels [53.8817160001038]
We propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding. To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm. PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods.
arXiv Detail & Related papers (2024-09-30T01:13:03Z)
Pseudo Labelling for Enhanced Masked Autoencoders [27.029542823306866]
We propose an enhanced approach that boosts Masked Autoencoders (MAE) performance by integrating pseudo labelling for both class and data tokens. This strategy uses cluster assignments as pseudo labels to promote instance-level discrimination within the network. We show that incorporating pseudo-labelling as an auxiliary task has demonstrated notable improvements in ImageNet-1K and other downstream tasks.
arXiv Detail & Related papers (2024-06-25T10:41:45Z)
Boosting Video Object Segmentation via Space-time Correspondence Learning [48.8275459383339]
Current solutions for video object segmentation (VOS) typically follow a matching-based regime. We devise a correspondence-aware training framework, which boosts matching-based VOS solutions by explicitly encouraging robust correspondence matching. Our algorithm provides solid performance gains on four widely used benchmarks.
arXiv Detail & Related papers (2023-04-13T01:34:44Z)
Towards Robust Video Object Segmentation with Adaptive Object Calibration [18.094698623128146]
Video object segmentation (VOS) aims at segmenting objects in all target frames of a video, given annotated object masks of reference frames. We propose a new deep network, which can adaptively construct object representations and calibrate object masks to achieve stronger robustness. Our model achieves the state-of-the-art performance among existing published works, and also exhibits superior robustness against perturbations.
arXiv Detail & Related papers (2022-07-02T17:51:29Z)
Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised Correspondence Learning [74.03651142051656]
We develop LIIR, a locality-aware inter-and intra-video reconstruction framework. We exploit cross video affinities as extra negative samples within a unified, inter-and intra-video reconstruction scheme.
arXiv Detail & Related papers (2022-03-27T15:46:42Z)
Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild. Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation. Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z)
Unsupervised Learning of Video Representations via Dense Trajectory Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos. We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation. We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.