Self-supervision through Random Segments with Autoregressive Coding
(RandSAC)
- URL: http://arxiv.org/abs/2203.12054v1
- Date: Tue, 22 Mar 2022 21:28:55 GMT
- Title: Self-supervision through Random Segments with Autoregressive Coding
(RandSAC)
- Authors: Tianyu Hua, Yonglong Tian, Sucheng Ren, Hang Zhao, Leonid Sigal
- Abstract summary: We explore the effects various design choices have on the success of applying such training strategies for visual feature learning.
Specifically, we introduce a novel strategy that we call Random Segments with Autoregressive Coding (RandSAC)
In RandSAC, we group patch representations (image tokens) into hierarchically arranged segments; within each segment, tokens are predicted in parallel, similar to BERT, while across segment predictions are sequential, similar to GPT.
We illustrate that randomized serialization of the segments significantly improves the performance and results in distribution over spatially-long (across-segments) and -short (within-
- Score: 46.519302668058025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inspired by the success of self-supervised autoregressive representation
learning in natural language (GPT and its variants), and advances in recent
visual architecture design with Vision Transformers (ViTs), in this paper, we
explore the effects various design choices have on the success of applying such
training strategies for visual feature learning. Specifically, we introduce a
novel strategy that we call Random Segments with Autoregressive Coding
(RandSAC). In RandSAC, we group patch representations (image tokens) into
hierarchically arranged segments; within each segment, tokens are predicted in
parallel, similar to BERT, while across segment predictions are sequential,
similar to GPT. We illustrate that randomized serialization of the segments
significantly improves the performance and results in distribution over
spatially-long (across-segments) and -short (within-segment) predictions which
are effective for feature learning. We illustrate the pertinence of these
design choices and explore alternatives on a number of datasets (e.g., CIFAR10,
ImageNet). While our pre-training strategy works with vanilla Transformer, we
also propose a conceptually simple, but highly effective, addition to the
decoder that allows learnable skip-connections to encoder feature layers, which
further improves the performance. Our final model, trained on ImageNet,
achieves new state-of-the-art linear probing performance 68.3% among
comparative predictive self-supervised learning approaches.
Related papers
- Freestyle Sketch-in-the-Loop Image Segmentation [116.1810651297801]
We introduce a "sketch-in-the-loop" image segmentation framework, enabling the segmentation of visual concepts partially, completely, or in groupings.
This framework capitalises on the synergy between sketch-based image retrieval models and large-scale pre-trained models.
Our purpose-made augmentation strategy enhances the versatility of our sketch-guided mask generation, allowing segmentation at multiple levels.
arXiv Detail & Related papers (2025-01-27T13:07:51Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Unveiling Backbone Effects in CLIP: Exploring Representational Synergies
and Variances [49.631908848868505]
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning.
We investigate the differences in CLIP performance among various neural architectures.
We propose a simple, yet effective approach to combine predictions from multiple backbones, leading to a notable performance boost of up to 6.34%.
arXiv Detail & Related papers (2023-12-22T03:01:41Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based
Point-Level Consistency [12.881617910150688]
We propose a transformer framework for self-supervised learning called DenseDINO to learn dense visual representations.
Specifically, DenseDINO introduces some extra input tokens called reference tokens to match the point-level features with the position prior.
Compared with the vanilla DINO, our approach obtains competitive performance when evaluated on classification in ImageNet.
arXiv Detail & Related papers (2023-06-06T15:04:45Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z) - Reviving Iterative Training with Mask Guidance for Interactive
Segmentation [8.271859911016719]
Recent works on click-based interactive segmentation have demonstrated state-of-the-art results by using various inference-time optimization schemes.
We propose a simple feedforward model for click-based interactive segmentation that employs the segmentation masks from previous steps.
We find that the models trained on a combination of COCO and LVIS with diverse and high-quality annotations show performance superior to all existing models.
arXiv Detail & Related papers (2021-02-12T15:44:31Z) - Self-Supervised Tuning for Few-Shot Segmentation [82.32143982269892]
Few-shot segmentation aims at assigning a category label to each image pixel with few annotated samples.
Existing meta-learning method tends to fail in generating category-specifically discriminative descriptor when the visual features extracted from support images are marginalized in embedding space.
This paper presents an adaptive framework tuning, in which the distribution of latent features across different episodes is dynamically adjusted based on a self-segmentation scheme.
arXiv Detail & Related papers (2020-04-12T03:53:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.