Related papers: In-N-Out Generative Learning for Dense Unsupervised Video Segmentation

In-N-Out Generative Learning for Dense Unsupervised Video Segmentation

URL: http://arxiv.org/abs/2203.15312v1
Date: Tue, 29 Mar 2022 07:56:21 GMT
Title: In-N-Out Generative Learning for Dense Unsupervised Video Segmentation
Authors: Xiao Pan, Peike Li, Zongxin Yang, Huiling Zhou, Chang Zhou, Hongxia Yang, Jingren Zhou, Yi Yang
Abstract summary: In this paper, we focus on the unsupervised Video Object (VOS) task which learns visual correspondence from unlabeled videos. We propose the In-aNd-Out (INO) generative learning from a purely generative perspective, which captures both high-level and fine-grained semantics. Our INO outperforms previous state-of-the-art methods by significant margins.
Score: 89.21483504654282
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we focus on the unsupervised Video Object Segmentation (VOS) task which learns visual correspondence from unlabeled videos. Previous methods are mainly based on the contrastive learning paradigm, which optimize either in pixel level or image level and show unsatisfactory scalability. Image-level optimization learns pixel-wise information implicitly therefore is sub-optimal for such dense prediction task, while pixel-level optimization ignores the high-level semantic scope for capturing object deformation. To complementarily learn these two levels of information in an unified framework, we propose the In-aNd-Out (INO) generative learning from a purely generative perspective, which captures both high-level and fine-grained semantics by leveraging the structural superiority of Vision Transformer (ViT) and achieves better scalability. Specifically, the in-generative learning recovers the corrupted parts of an image via inferring its fine-grained semantic structure, while the out-generative learning captures high-level semantics by imagining the global information of an image given only random fragments. To better discover the temporal information, we additionally force the inter-frame consistency from both feature level and affinity matrix level. Extensive experiments on DAVIS-2017 val and YouTube-VOS 2018 val show that our INO outperforms previous state-of-the-art methods by significant margins.

Related papers

Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation [1.5223740593989445]
Unsupervised Video Object (UVOS) aims to predict pixel-level masks for the most salient objects in videos without any prior annotations.<n>Our analysis reveals a simple but fundamental flaw in existing methods: over-reliance on memorizing high-level semantic features.<n>We propose a novel hierarchical memory architecture to incorporate both shallow- and high-level features for memory.
arXiv Detail & Related papers (2025-07-30T08:11:18Z)
IPSeg: Image Posterior Mitigates Semantic Drift in Class-Incremental Segmentation [77.06177202334398]
We identify two critical challenges in CISS that contribute to semantic drift and degrade performance. First, we highlight the issue of separate optimization, where different parts of the model are optimized in distinct incremental stages. Second, we identify noisy semantics arising from inappropriate pseudo-labeling, which results in sub-optimal results.
arXiv Detail & Related papers (2025-02-07T12:19:37Z)
Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation [8.659766913542938]
We study a united perceptual and semantic token compression for all granular understanding. We propose Feature Pyramid Tokenization (PAT) to cluster and represent multi-resolution feature by learnable codebooks. Our experiments show that PAT enhances the semantic intuition of VLM feature pyramid.
arXiv Detail & Related papers (2024-12-18T18:43:21Z)
Locality Alignment Improves Vision-Language Models [55.275235524659905]
Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors. We propose a new efficient post-training stage for ViTs called locality alignment. We show that locality-aligned backbones improve performance across a range of benchmarks.
arXiv Detail & Related papers (2024-10-14T21:01:01Z)
Hierarchical Semantic Contrast for Scene-aware Video Anomaly Detection [14.721615285883423]
We propose a hierarchical semantic contrast (HSC) method to learn a scene-aware VAD model from normal videos. This hierarchical semantic contrast strategy helps to deal with the diversity of normal patterns and also increases their discrimination ability.
arXiv Detail & Related papers (2023-03-23T05:53:34Z)
Generative Negative Text Replay for Continual Vision-Language Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently. Massive data are usually collected in a streaming fashion. We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z)
Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance. We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset. Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z)
A Pixel-Level Meta-Learner for Weakly Supervised Few-Shot Semantic Segmentation [40.27705176115985]
Few-shot semantic segmentation addresses the learning task in which only few images with ground truth pixel-level labels are available for the novel classes of interest. We propose a novel meta-learning framework, which predicts pseudo pixel-level segmentation masks from a limited amount of data and their semantic labels. Our proposed learning model can be viewed as a pixel-level meta-learner.
arXiv Detail & Related papers (2021-11-02T08:28:11Z)
Maximize the Exploration of Congeneric Semantics for Weakly Supervised Semantic Segmentation [27.155133686127474]
We construct a graph neural network (P-GNN) based on the self-detected patches from different images that contain the same class labels. We conduct experiments on the popular PASCAL VOC 2012 benchmarks, and our model yields state-of-the-art performance.
arXiv Detail & Related papers (2021-10-08T08:59:16Z)
InfoSeg: Unsupervised Semantic Image Segmentation with Mutual Information Maximization [0.0]
We propose a novel method for unsupervised image representation based on mutual information between local and global high-level image features. In the first step, we segment images based on local and global features. In the second step, we maximize the mutual information between local features and high-level features of their respective class.
arXiv Detail & Related papers (2021-10-07T14:01:42Z)
Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation. We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths. In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z)
Learning Video Object Segmentation from Unlabeled Videos [158.18207922363783]
We propose a new method for video object segmentation (VOS) that addresses object pattern learning from unlabeled videos. We introduce a unified unsupervised/weakly supervised learning framework, called MuG, that comprehensively captures properties of VOS at multiple granularities.
arXiv Detail & Related papers (2020-03-10T22:12:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.