Learning Explicit Object-Centric Representations with Vision
Transformers
- URL: http://arxiv.org/abs/2210.14139v1
- Date: Tue, 25 Oct 2022 16:39:49 GMT
- Title: Learning Explicit Object-Centric Representations with Vision
Transformers
- Authors: Oscar Vikstr\"om, Alexander Ilin
- Abstract summary: We build on the self-supervision task of masked autoencoding and explore its effectiveness for learning object-centric representations with transformers.
We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.
- Score: 81.38804205212425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the recent successful adaptation of transformers to the vision domain,
particularly when trained in a self-supervised fashion, it has been shown that
vision transformers can learn impressive object-reasoning-like behaviour and
features expressive for the task of object segmentation in images. In this
paper, we build on the self-supervision task of masked autoencoding and explore
its effectiveness for explicitly learning object-centric representations with
transformers. To this end, we design an object-centric autoencoder using
transformers only and train it end-to-end to reconstruct full images from
unmasked patches. We show that the model efficiently learns to decompose simple
scenes as measured by segmentation metrics on several multi-object benchmarks.
Related papers
- Vision Transformers Need Registers [26.63912173005165]
We identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks.
We show that this solution fixes that problem entirely for both supervised and self-supervised models.
arXiv Detail & Related papers (2023-09-28T16:45:46Z) - Holistically Explainable Vision Transformers [136.27303006772294]
We propose B-cos transformers, which inherently provide holistic explanations for their decisions.
Specifically, we formulate each model component - such as the multi-layer perceptrons, attention layers, and the tokenisation module - to be dynamic linear.
We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs.
arXiv Detail & Related papers (2023-01-20T16:45:34Z) - Vision Transformers: State of the Art and Research Challenges [26.462994554165697]
This paper presents a comprehensive overview of the literature on different architecture designs and training tricks for vision transformers.
Our goal is to provide a systematic review with the open research opportunities.
arXiv Detail & Related papers (2022-07-07T02:01:56Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - An Empirical Study Of Self-supervised Learning Approaches For Object
Detection With Transformers [0.0]
We explore self-supervised methods based on image reconstruction, masked image modeling and jigsaw.
Preliminary experiments in the iSAID dataset demonstrate faster convergence of DETR in the initial epochs in both pretraining and multi-task learning settings.
arXiv Detail & Related papers (2022-05-11T14:39:27Z) - A Survey of Visual Transformers [30.082304742571598]
Transformer, an attention-based encoder-decoder architecture, has revolutionized the field of natural language processing.
Some pioneering works have recently been done on adapting Transformer architectures to Computer Vision (CV) fields.
We have provided a comprehensive review of over one hundred different visual Transformers for three fundamental CV tasks.
arXiv Detail & Related papers (2021-11-11T07:56:04Z) - ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z) - Self-Supervised Representation Learning from Flow Equivariance [97.13056332559526]
We present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes.
Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images.
arXiv Detail & Related papers (2021-01-16T23:44:09Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.