MoViT: Memorizing Vision Transformers for Medical Image Analysis
- URL: http://arxiv.org/abs/2303.15553v3
- Date: Fri, 29 Sep 2023 20:14:37 GMT
- Title: MoViT: Memorizing Vision Transformers for Medical Image Analysis
- Authors: Yiqing Shen, Pengfei Guo, Jingpu Wu, Qianqi Huang, Nhat Le, Jinyuan
Zhou, Shanshan Jiang, Mathias Unberath
- Abstract summary: We propose a Memorizing Vision Transformer (MoViT) to alleviate the need for large-scale datasets to successfully train and deploy transformer-based architectures.
MoViT can reach a competitive performance of ViT with only 3.0% of the training data.
- Score: 13.541165687193581
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The synergy of long-range dependencies from transformers and local
representations of image content from convolutional neural networks (CNNs) has
led to advanced architectures and increased performance for various medical
image analysis tasks due to their complementary benefits. However, compared
with CNNs, transformers require considerably more training data, due to a
larger number of parameters and an absence of inductive bias. The need for
increasingly large datasets continues to be problematic, particularly in the
context of medical imaging, where both annotation efforts and data protection
result in limited data availability. In this work, inspired by the human
decision-making process of correlating new evidence with previously memorized
experience, we propose a Memorizing Vision Transformer (MoViT) to alleviate the
need for large-scale datasets to successfully train and deploy
transformer-based architectures. MoViT leverages an external memory structure
to cache history attention snapshots during the training stage. To prevent
overfitting, we incorporate an innovative memory update scheme, attention
temporal moving average, to update the stored external memories with the
historical moving average. For inference speedup, we design a prototypical
attention learning method to distill the external memory into smaller
representative subsets. We evaluate our method on a public histology image
dataset and an in-house MRI dataset, demonstrating that MoViT applied to varied
medical image analysis tasks, can outperform vanilla transformer models across
varied data regimes, especially in cases where only a small amount of annotated
data is available. More importantly, MoViT can reach a competitive performance
of ViT with only 3.0% of the training data.
Related papers
- Optimizing Vision Transformers with Data-Free Knowledge Transfer [8.323741354066474]
Vision transformers (ViTs) have excelled in various computer vision tasks due to their superior ability to capture long-distance dependencies.
We propose compressing large ViT models using Knowledge Distillation (KD), which is implemented data-free to circumvent limitations related to data availability.
arXiv Detail & Related papers (2024-08-12T07:03:35Z) - Efficient Visual State Space Model for Image Deblurring [83.57239834238035]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration.
We propose a simple yet effective visual state space model (EVSSM) for image deblurring.
arXiv Detail & Related papers (2024-05-23T09:13:36Z) - SeUNet-Trans: A Simple yet Effective UNet-Transformer Model for Medical
Image Segmentation [0.0]
We propose a simple yet effective UNet-Transformer (seUNet-Trans) model for medical image segmentation.
In our approach, the UNet model is designed as a feature extractor to generate multiple feature maps from the input images.
By leveraging the UNet architecture and the self-attention mechanism, our model not only retains the preservation of both local and global context information but also is capable of capturing long-range dependencies between input elements.
arXiv Detail & Related papers (2023-10-16T01:13:38Z) - Efficiently Training Vision Transformers on Structural MRI Scans for
Alzheimer's Disease Detection [2.359557447960552]
Vision transformers (ViT) have emerged in recent years as an alternative to CNNs for several computer vision applications.
We tested variants of the ViT architecture for a range of desired neuroimaging downstream tasks based on difficulty.
We achieved a performance boost of 5% and 9-10% upon fine-tuning vision transformer models pre-trained on synthetic and real MRI scans.
arXiv Detail & Related papers (2023-03-14T20:18:12Z) - AMIGO: Sparse Multi-Modal Graph Transformer with Shared-Context
Processing for Representation Learning of Giga-pixel Images [53.29794593104923]
We present a novel concept of shared-context processing for whole slide histopathology images.
AMIGO uses the celluar graph within the tissue to provide a single representation for a patient.
We show that our model is strongly robust to missing information to an extent that it can achieve the same performance with as low as 20% of the data.
arXiv Detail & Related papers (2023-03-01T23:37:45Z) - MultiCrossViT: Multimodal Vision Transformer for Schizophrenia
Prediction using Structural MRI and Functional Network Connectivity Data [0.0]
Vision Transformer (ViT) is a pioneering deep learning framework that can address real-world computer vision issues.
ViTs are proven to outperform traditional deep learning models, such as convolutional neural networks (CNNs)
arXiv Detail & Related papers (2022-11-12T19:07:25Z) - How to train your ViT? Data, Augmentation, and Regularization in Vision
Transformers [74.06040005144382]
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications.
We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget.
We train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
arXiv Detail & Related papers (2021-06-18T17:58:20Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - Medical Transformer: Gated Axial-Attention for Medical Image
Segmentation [73.98974074534497]
We study the feasibility of using Transformer-based network architectures for medical image segmentation tasks.
We propose a Gated Axial-Attention model which extends the existing architectures by introducing an additional control mechanism in the self-attention module.
To train the model effectively on medical images, we propose a Local-Global training strategy (LoGo) which further improves the performance.
arXiv Detail & Related papers (2021-02-21T18:35:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.