Self-supervision through Random Segments with Autoregressive Coding
(RandSAC)
- URL: http://arxiv.org/abs/2203.12054v1
- Date: Tue, 22 Mar 2022 21:28:55 GMT
- Title: Self-supervision through Random Segments with Autoregressive Coding
(RandSAC)
- Authors: Tianyu Hua, Yonglong Tian, Sucheng Ren, Hang Zhao, Leonid Sigal
- Abstract summary: We explore the effects various design choices have on the success of applying such training strategies for visual feature learning.
Specifically, we introduce a novel strategy that we call Random Segments with Autoregressive Coding (RandSAC)
In RandSAC, we group patch representations (image tokens) into hierarchically arranged segments; within each segment, tokens are predicted in parallel, similar to BERT, while across segment predictions are sequential, similar to GPT.
We illustrate that randomized serialization of the segments significantly improves the performance and results in distribution over spatially-long (across-segments) and -short (within-
- Score: 46.519302668058025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inspired by the success of self-supervised autoregressive representation
learning in natural language (GPT and its variants), and advances in recent
visual architecture design with Vision Transformers (ViTs), in this paper, we
explore the effects various design choices have on the success of applying such
training strategies for visual feature learning. Specifically, we introduce a
novel strategy that we call Random Segments with Autoregressive Coding
(RandSAC). In RandSAC, we group patch representations (image tokens) into
hierarchically arranged segments; within each segment, tokens are predicted in
parallel, similar to BERT, while across segment predictions are sequential,
similar to GPT. We illustrate that randomized serialization of the segments
significantly improves the performance and results in distribution over
spatially-long (across-segments) and -short (within-segment) predictions which
are effective for feature learning. We illustrate the pertinence of these
design choices and explore alternatives on a number of datasets (e.g., CIFAR10,
ImageNet). While our pre-training strategy works with vanilla Transformer, we
also propose a conceptually simple, but highly effective, addition to the
decoder that allows learnable skip-connections to encoder feature layers, which
further improves the performance. Our final model, trained on ImageNet,
achieves new state-of-the-art linear probing performance 68.3% among
comparative predictive self-supervised learning approaches.
Related papers
- Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Unveiling Backbone Effects in CLIP: Exploring Representational Synergies
and Variances [49.631908848868505]
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning.
We investigate the differences in CLIP performance among various neural architectures.
We propose a simple, yet effective approach to combine predictions from multiple backbones, leading to a notable performance boost of up to 6.34%.
arXiv Detail & Related papers (2023-12-22T03:01:41Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based
Point-Level Consistency [12.881617910150688]
We propose a transformer framework for self-supervised learning called DenseDINO to learn dense visual representations.
Specifically, DenseDINO introduces some extra input tokens called reference tokens to match the point-level features with the position prior.
Compared with the vanilla DINO, our approach obtains competitive performance when evaluated on classification in ImageNet.
arXiv Detail & Related papers (2023-06-06T15:04:45Z) - An EM Framework for Online Incremental Learning of Semantic Segmentation [37.94734474090863]
We propose an incremental learning strategy that can adapt deep segmentation models without catastrophic forgetting, using a streaming input data with pixel annotations on the novel classes only.
We validate our approach on the PASCAL VOC 2012 and ADE20K datasets, and the results demonstrate its superior performance over the existing incremental methods.
arXiv Detail & Related papers (2021-08-08T11:30:09Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z) - Reviving Iterative Training with Mask Guidance for Interactive
Segmentation [8.271859911016719]
Recent works on click-based interactive segmentation have demonstrated state-of-the-art results by using various inference-time optimization schemes.
We propose a simple feedforward model for click-based interactive segmentation that employs the segmentation masks from previous steps.
We find that the models trained on a combination of COCO and LVIS with diverse and high-quality annotations show performance superior to all existing models.
arXiv Detail & Related papers (2021-02-12T15:44:31Z) - Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths.
In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z) - Self-Supervised Tuning for Few-Shot Segmentation [82.32143982269892]
Few-shot segmentation aims at assigning a category label to each image pixel with few annotated samples.
Existing meta-learning method tends to fail in generating category-specifically discriminative descriptor when the visual features extracted from support images are marginalized in embedding space.
This paper presents an adaptive framework tuning, in which the distribution of latent features across different episodes is dynamically adjusted based on a self-segmentation scheme.
arXiv Detail & Related papers (2020-04-12T03:53:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.