In-N-Out Generative Learning for Dense Unsupervised Video Segmentation
- URL: http://arxiv.org/abs/2203.15312v1
- Date: Tue, 29 Mar 2022 07:56:21 GMT
- Title: In-N-Out Generative Learning for Dense Unsupervised Video Segmentation
- Authors: Xiao Pan, Peike Li, Zongxin Yang, Huiling Zhou, Chang Zhou, Hongxia
Yang, Jingren Zhou, Yi Yang
- Abstract summary: In this paper, we focus on the unsupervised Video Object (VOS) task which learns visual correspondence from unlabeled videos.
We propose the In-aNd-Out (INO) generative learning from a purely generative perspective, which captures both high-level and fine-grained semantics.
Our INO outperforms previous state-of-the-art methods by significant margins.
- Score: 89.21483504654282
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we focus on the unsupervised Video Object Segmentation (VOS)
task which learns visual correspondence from unlabeled videos. Previous methods
are mainly based on the contrastive learning paradigm, which optimize either in
pixel level or image level and show unsatisfactory scalability. Image-level
optimization learns pixel-wise information implicitly therefore is sub-optimal
for such dense prediction task, while pixel-level optimization ignores the
high-level semantic scope for capturing object deformation. To complementarily
learn these two levels of information in an unified framework, we propose the
In-aNd-Out (INO) generative learning from a purely generative perspective,
which captures both high-level and fine-grained semantics by leveraging the
structural superiority of Vision Transformer (ViT) and achieves better
scalability. Specifically, the in-generative learning recovers the corrupted
parts of an image via inferring its fine-grained semantic structure, while the
out-generative learning captures high-level semantics by imagining the global
information of an image given only random fragments. To better discover the
temporal information, we additionally force the inter-frame consistency from
both feature level and affinity matrix level. Extensive experiments on
DAVIS-2017 val and YouTube-VOS 2018 val show that our INO outperforms previous
state-of-the-art methods by significant margins.
Related papers
- IPSeg: Image Posterior Mitigates Semantic Drift in Class-Incremental Segmentation [77.06177202334398]
We identify two critical challenges in CISS that contribute to semantic drift and degrade performance.
First, we highlight the issue of separate optimization, where different parts of the model are optimized in distinct incremental stages.
Second, we identify noisy semantics arising from inappropriate pseudo-labeling, which results in sub-optimal results.
arXiv Detail & Related papers (2025-02-07T12:19:37Z) - Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation [8.659766913542938]
We study a united perceptual and semantic token compression for all granular understanding.
We propose Feature Pyramid Tokenization (PAT) to cluster and represent multi-resolution feature by learnable codebooks.
Our experiments show that PAT enhances the semantic intuition of VLM feature pyramid.
arXiv Detail & Related papers (2024-12-18T18:43:21Z) - Locality Alignment Improves Vision-Language Models [55.275235524659905]
Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors.
We propose a new efficient post-training stage for ViTs called locality alignment.
We show that locality-aligned backbones improve performance across a range of benchmarks.
arXiv Detail & Related papers (2024-10-14T21:01:01Z) - Hierarchical Semantic Contrast for Scene-aware Video Anomaly Detection [14.721615285883423]
We propose a hierarchical semantic contrast (HSC) method to learn a scene-aware VAD model from normal videos.
This hierarchical semantic contrast strategy helps to deal with the diversity of normal patterns and also increases their discrimination ability.
arXiv Detail & Related papers (2023-03-23T05:53:34Z) - Generative Negative Text Replay for Continual Vision-Language
Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently.
Massive data are usually collected in a streaming fashion.
We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - A Pixel-Level Meta-Learner for Weakly Supervised Few-Shot Semantic
Segmentation [40.27705176115985]
Few-shot semantic segmentation addresses the learning task in which only few images with ground truth pixel-level labels are available for the novel classes of interest.
We propose a novel meta-learning framework, which predicts pseudo pixel-level segmentation masks from a limited amount of data and their semantic labels.
Our proposed learning model can be viewed as a pixel-level meta-learner.
arXiv Detail & Related papers (2021-11-02T08:28:11Z) - InfoSeg: Unsupervised Semantic Image Segmentation with Mutual
Information Maximization [0.0]
We propose a novel method for unsupervised image representation based on mutual information between local and global high-level image features.
In the first step, we segment images based on local and global features.
In the second step, we maximize the mutual information between local features and high-level features of their respective class.
arXiv Detail & Related papers (2021-10-07T14:01:42Z) - Learning Video Object Segmentation from Unlabeled Videos [158.18207922363783]
We propose a new method for video object segmentation (VOS) that addresses object pattern learning from unlabeled videos.
We introduce a unified unsupervised/weakly supervised learning framework, called MuG, that comprehensively captures properties of VOS at multiple granularities.
arXiv Detail & Related papers (2020-03-10T22:12:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.