ViT-AE++: Improving Vision Transformer Autoencoder for Self-supervised
  Medical Image Representations
        - URL: http://arxiv.org/abs/2301.07382v2
- Date: Mon, 15 May 2023 20:40:07 GMT
- Title: ViT-AE++: Improving Vision Transformer Autoencoder for Self-supervised
  Medical Image Representations
- Authors: Chinmay Prabhakar, Hongwei Bran Li, Jiancheng Yang, Suprosana Shit,
  Benedikt Wiestler, and Bjoern Menze
- Abstract summary: Vision transformer-based autoencoder (ViT-AE) is a self-supervised learning technique that employs a patch-masking strategy to learn a meaningful latent space.
We propose two new loss functions to enhance the representation during training.
We extensively evaluate ViT-AE++ on both natural images and medical images, demonstrating consistent improvement over vanilla ViT-AE.
- Score: 3.6284577335311554
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract:   Self-supervised learning has attracted increasing attention as it learns
data-driven representation from data without annotations. Vision
transformer-based autoencoder (ViT-AE) by He et al. (2021) is a recent
self-supervised learning technique that employs a patch-masking strategy to
learn a meaningful latent space. In this paper, we focus on improving ViT-AE
(nicknamed ViT-AE++) for a more effective representation of 2D and 3D medical
images. We propose two new loss functions to enhance the representation during
training. The first loss term aims to improve self-reconstruction by
considering the structured dependencies and indirectly improving the
representation. The second loss term leverages contrastive loss to optimize the
representation from two randomly masked views directly. We extended ViT-AE++ to
a 3D fashion for volumetric medical images as an independent contribution. We
extensively evaluate ViT-AE++ on both natural images and medical images,
demonstrating consistent improvement over vanilla ViT-AE and its superiority
over other contrastive learning approaches. Codes are here:
https://github.com/chinmay5/vit_ae_plus_plus.git.
 
      
        Related papers
        - Seeing the Pose in the Pixels: Learning Pose-Aware Representations in
  Vision Transformers [1.8047694351309207]
 We introduce two strategies for learning pose-aware representations in Vision Transformer (ViT)
The first method, called Pose-aware Attention Block (PAAB), is a plug-and-play ViT block that performs localized attention on pose regions within videos.
The second method, dubbed Pose-Aware Auxiliary Task (PAAT), presents an auxiliary pose prediction task optimized jointly with the primary ViT task.
 arXiv  Detail & Related papers  (2023-06-15T17:58:39Z)
- RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in
  Autonomous Driving [80.14669385741202]
 Vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks.
ViTs are notoriously hard to train and require a lot of training data to learn powerful representations.
We show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and Semantic KITTI.
 arXiv  Detail & Related papers  (2023-01-24T18:50:48Z)
- What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
 Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
 arXiv  Detail & Related papers  (2022-12-13T16:55:12Z)
- Patch-level Representation Learning for Self-supervised Vision
  Transformers [68.8862419248863]
 Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks.
Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations.
We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
 arXiv  Detail & Related papers  (2022-06-16T08:01:19Z)
- SERE: Exploring Feature Self-relation for Self-supervised Transformer [79.5769147071757]
 Vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks.
Recent works reveal that self-supervised learning helps unleash the great potential of ViT.
We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks.
 arXiv  Detail & Related papers  (2022-06-10T15:25:00Z)
- Position Labels for Self-Supervised Vision Transformer [1.3406858660972554]
 Position encoding is important for vision transformer (ViT) to capture the spatial structure of the input image.
We propose two position labels dedicated to 2D images including absolute position and relative position.
Our position labels can be easily plugged into transformer, combined with the various current ViT variants.
 arXiv  Detail & Related papers  (2022-06-10T10:29:20Z)
- Emerging Properties in Self-Supervised Vision Transformers [57.36837447500544]
 We show that self-supervised ViTs provide new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets)
We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels.
We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.
 arXiv  Detail & Related papers  (2021-04-29T12:28:51Z)
- DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
 Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
 arXiv  Detail & Related papers  (2021-03-22T14:32:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.