In-N-Out Generative Learning for Dense Unsupervised Video Segmentation
- URL: http://arxiv.org/abs/2203.15312v1
- Date: Tue, 29 Mar 2022 07:56:21 GMT
- Title: In-N-Out Generative Learning for Dense Unsupervised Video Segmentation
- Authors: Xiao Pan, Peike Li, Zongxin Yang, Huiling Zhou, Chang Zhou, Hongxia
Yang, Jingren Zhou, Yi Yang
- Abstract summary: In this paper, we focus on the unsupervised Video Object (VOS) task which learns visual correspondence from unlabeled videos.
We propose the In-aNd-Out (INO) generative learning from a purely generative perspective, which captures both high-level and fine-grained semantics.
Our INO outperforms previous state-of-the-art methods by significant margins.
- Score: 89.21483504654282
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we focus on the unsupervised Video Object Segmentation (VOS)
task which learns visual correspondence from unlabeled videos. Previous methods
are mainly based on the contrastive learning paradigm, which optimize either in
pixel level or image level and show unsatisfactory scalability. Image-level
optimization learns pixel-wise information implicitly therefore is sub-optimal
for such dense prediction task, while pixel-level optimization ignores the
high-level semantic scope for capturing object deformation. To complementarily
learn these two levels of information in an unified framework, we propose the
In-aNd-Out (INO) generative learning from a purely generative perspective,
which captures both high-level and fine-grained semantics by leveraging the
structural superiority of Vision Transformer (ViT) and achieves better
scalability. Specifically, the in-generative learning recovers the corrupted
parts of an image via inferring its fine-grained semantic structure, while the
out-generative learning captures high-level semantics by imagining the global
information of an image given only random fragments. To better discover the
temporal information, we additionally force the inter-frame consistency from
both feature level and affinity matrix level. Extensive experiments on
DAVIS-2017 val and YouTube-VOS 2018 val show that our INO outperforms previous
state-of-the-art methods by significant margins.
Related papers
- Locality Alignment Improves Vision-Language Models [55.275235524659905]
Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors.
We propose a new efficient post-training stage for ViTs called locality alignment.
We show that locality-aligned backbones improve performance across a range of benchmarks.
arXiv Detail & Related papers (2024-10-14T21:01:01Z) - Hierarchical Semantic Contrast for Scene-aware Video Anomaly Detection [14.721615285883423]
We propose a hierarchical semantic contrast (HSC) method to learn a scene-aware VAD model from normal videos.
This hierarchical semantic contrast strategy helps to deal with the diversity of normal patterns and also increases their discrimination ability.
arXiv Detail & Related papers (2023-03-23T05:53:34Z) - Generative Negative Text Replay for Continual Vision-Language
Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently.
Massive data are usually collected in a streaming fashion.
We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - A Pixel-Level Meta-Learner for Weakly Supervised Few-Shot Semantic
Segmentation [40.27705176115985]
Few-shot semantic segmentation addresses the learning task in which only few images with ground truth pixel-level labels are available for the novel classes of interest.
We propose a novel meta-learning framework, which predicts pseudo pixel-level segmentation masks from a limited amount of data and their semantic labels.
Our proposed learning model can be viewed as a pixel-level meta-learner.
arXiv Detail & Related papers (2021-11-02T08:28:11Z) - Maximize the Exploration of Congeneric Semantics for Weakly Supervised
Semantic Segmentation [27.155133686127474]
We construct a graph neural network (P-GNN) based on the self-detected patches from different images that contain the same class labels.
We conduct experiments on the popular PASCAL VOC 2012 benchmarks, and our model yields state-of-the-art performance.
arXiv Detail & Related papers (2021-10-08T08:59:16Z) - InfoSeg: Unsupervised Semantic Image Segmentation with Mutual
Information Maximization [0.0]
We propose a novel method for unsupervised image representation based on mutual information between local and global high-level image features.
In the first step, we segment images based on local and global features.
In the second step, we maximize the mutual information between local features and high-level features of their respective class.
arXiv Detail & Related papers (2021-10-07T14:01:42Z) - Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths.
In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z) - Learning Video Object Segmentation from Unlabeled Videos [158.18207922363783]
We propose a new method for video object segmentation (VOS) that addresses object pattern learning from unlabeled videos.
We introduce a unified unsupervised/weakly supervised learning framework, called MuG, that comprehensively captures properties of VOS at multiple granularities.
arXiv Detail & Related papers (2020-03-10T22:12:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.