Related papers: ViT-AE++: Improving Vision Transformer Autoencoder for Self-supervised Medical Image Representations

ViT-AE++: Improving Vision Transformer Autoencoder for Self-supervised Medical Image Representations

URL: http://arxiv.org/abs/2301.07382v2
Date: Mon, 15 May 2023 20:40:07 GMT
Title: ViT-AE++: Improving Vision Transformer Autoencoder for Self-supervised Medical Image Representations
Authors: Chinmay Prabhakar, Hongwei Bran Li, Jiancheng Yang, Suprosana Shit, Benedikt Wiestler, and Bjoern Menze
Abstract summary: Vision transformer-based autoencoder (ViT-AE) is a self-supervised learning technique that employs a patch-masking strategy to learn a meaningful latent space. We propose two new loss functions to enhance the representation during training. We extensively evaluate ViT-AE++ on both natural images and medical images, demonstrating consistent improvement over vanilla ViT-AE.
Score: 3.6284577335311554
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Self-supervised learning has attracted increasing attention as it learns data-driven representation from data without annotations. Vision transformer-based autoencoder (ViT-AE) by He et al. (2021) is a recent self-supervised learning technique that employs a patch-masking strategy to learn a meaningful latent space. In this paper, we focus on improving ViT-AE (nicknamed ViT-AE++) for a more effective representation of 2D and 3D medical images. We propose two new loss functions to enhance the representation during training. The first loss term aims to improve self-reconstruction by considering the structured dependencies and indirectly improving the representation. The second loss term leverages contrastive loss to optimize the representation from two randomly masked views directly. We extended ViT-AE++ to a 3D fashion for volumetric medical images as an independent contribution. We extensively evaluate ViT-AE++ on both natural images and medical images, demonstrating consistent improvement over vanilla ViT-AE and its superiority over other contrastive learning approaches. Codes are here: https://github.com/chinmay5/vit_ae_plus_plus.git.

Related papers

Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers [1.8047694351309207]
We introduce two strategies for learning pose-aware representations in Vision Transformer (ViT) The first method, called Pose-aware Attention Block (PAAB), is a plug-and-play ViT block that performs localized attention on pose regions within videos. The second method, dubbed Pose-Aware Auxiliary Task (PAAT), presents an auxiliary pose prediction task optimized jointly with the primary ViT task.
arXiv Detail & Related papers (2023-06-15T17:58:39Z)
RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving [80.14669385741202]
Vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks. ViTs are notoriously hard to train and require a lot of training data to learn powerful representations. We show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and Semantic KITTI.
arXiv Detail & Related papers (2023-01-24T18:50:48Z)
What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision. This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs. We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z)
Patch-level Representation Learning for Self-supervised Vision Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks. Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations. We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z)
SERE: Exploring Feature Self-relation for Self-supervised Transformer [79.5769147071757]
Vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks. Recent works reveal that self-supervised learning helps unleash the great potential of ViT. We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks.
arXiv Detail & Related papers (2022-06-10T15:25:00Z)
Position Labels for Self-Supervised Vision Transformer [1.3406858660972554]
Position encoding is important for vision transformer (ViT) to capture the spatial structure of the input image. We propose two position labels dedicated to 2D images including absolute position and relative position. Our position labels can be easily plugged into transformer, combined with the various current ViT variants.
arXiv Detail & Related papers (2022-06-10T10:29:20Z)
Emerging Properties in Self-Supervised Vision Transformers [57.36837447500544]
We show that self-supervised ViTs provide new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.
arXiv Detail & Related papers (2021-04-29T12:28:51Z)
DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently. We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.