Transfer Learning for Microstructure Segmentation with CS-UNet: A Hybrid
Algorithm with Transformer and CNN Encoders
- URL: http://arxiv.org/abs/2308.13917v1
- Date: Sat, 26 Aug 2023 16:56:15 GMT
- Title: Transfer Learning for Microstructure Segmentation with CS-UNet: A Hybrid
Algorithm with Transformer and CNN Encoders
- Authors: Khaled Alrfou, Tian Zhao, Amir Kordijazi
- Abstract summary: We compare the segmentation performance of Transformer and CNN models pre-trained on microscopy images with those pre-trained on natural images.
We also find that for image segmentation, the combination of pre-trained Transformers and CNN encoders are consistently better than pre-trained CNN encoders alone.
- Score: 0.2353157426758003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transfer learning improves the performance of deep learning models by
initializing them with parameters pre-trained on larger datasets. Intuitively,
transfer learning is more effective when pre-training is on the in-domain
datasets. A recent study by NASA has demonstrated that the microstructure
segmentation with encoder-decoder algorithms benefits more from CNN encoders
pre-trained on microscopy images than from those pre-trained on natural images.
However, CNN models only capture the local spatial relations in images. In
recent years, attention networks such as Transformers are increasingly used in
image analysis to capture the long-range relations between pixels. In this
study, we compare the segmentation performance of Transformer and CNN models
pre-trained on microscopy images with those pre-trained on natural images. Our
result partially confirms the NASA study that the segmentation performance of
out-of-distribution images (taken under different imaging and sample
conditions) is significantly improved when pre-training on microscopy images.
However, the performance gain for one-shot and few-shot learning is more modest
with Transformers. We also find that for image segmentation, the combination of
pre-trained Transformers and CNN encoders are consistently better than
pre-trained CNN encoders alone. Our dataset (of about 50,000 images) combines
the public portion of the NASA dataset with additional images we collected.
Even with much less training data, our pre-trained models have significantly
better performance for image segmentation. This result suggests that
Transformers and CNN complement each other and when pre-trained on microscopy
images, they are more beneficial to the downstream tasks.
Related papers
- ConvTransSeg: A Multi-resolution Convolution-Transformer Network for
Medical Image Segmentation [14.485482467748113]
We propose a hybrid encoder-decoder segmentation model (ConvTransSeg)
It consists of a multi-layer CNN as the encoder for feature learning and the corresponding multi-level Transformer as the decoder for segmentation prediction.
Our method achieves the best performance in terms of Dice coefficient and average symmetric surface distance measures with low model complexity and memory consumption.
arXiv Detail & Related papers (2022-10-13T14:59:23Z) - An Empirical Study of Remote Sensing Pretraining [117.90699699469639]
We conduct an empirical study of remote sensing pretraining (RSP) on aerial images.
RSP can help deliver distinctive performances in scene recognition tasks.
RSP mitigates the data discrepancies of traditional ImageNet pretraining on RS images, but it may still suffer from task discrepancies.
arXiv Detail & Related papers (2022-04-06T13:38:11Z) - Training Vision Transformers with Only 2040 Images [35.86457465241119]
Vision Transformers (ViTs) is emerging as an alternative to convolutional neural networks (CNNs) for visual recognition.
We give theoretical analyses that our method is superior to other methods in that it can capture both feature alignment and instance similarities.
We achieve state-of-the-art results when training from scratch on 7 small datasets under various ViT backbones.
arXiv Detail & Related papers (2022-01-26T03:22:08Z) - Semi-Supervised Medical Image Segmentation via Cross Teaching between
CNN and Transformer [11.381487613753004]
We present a framework for semi-supervised medical image segmentation by introducing the cross teaching between CNN and Transformer.
Notably, this work may be the first attempt to combine CNN and transformer for semi-supervised medical image segmentation and achieve promising results on a public benchmark.
arXiv Detail & Related papers (2021-12-09T13:22:38Z) - Vision Pair Learning: An Efficient Training Framework for Image
Classification [0.8223798883838329]
Transformer and CNN are complementary in representation learning and convergence speed.
Vision Pair Learning (VPL) builds up a network composed of a transformer branch, a CNN branch and pair learning module.
VPL promotes the top-1 accuracy of ViT-Base and ResNet-50 on the ImageNet-1k validation set to 83.47% and 79.61% respectively.
arXiv Detail & Related papers (2021-12-02T03:45:16Z) - Semi-Supervised Vision Transformers [76.83020291497895]
We study the training of Vision Transformers for semi-supervised image classification.
We find Vision Transformers perform poorly on a semi-supervised ImageNet setting.
CNNs achieve superior results in small labeled data regime.
arXiv Detail & Related papers (2021-11-22T09:28:13Z) - Investigating Transfer Learning Capabilities of Vision Transformers and
CNNs by Fine-Tuning a Single Trainable Block [0.0]
transformer-based architectures are surpassing the state-of-the-art set by CNN architectures in accuracy but are computationally very expensive to train from scratch.
We study it's transfer learning capabilities and compare it with CNNs so that we can understand which architecture is better when applied to real world problems with small data.
We find out that transformers-based architectures not only achieve higher accuracy than CNNs but some transformers even achieve this feat with around 4 times lesser number of parameters.
arXiv Detail & Related papers (2021-10-11T13:43:03Z) - How to train your ViT? Data, Augmentation, and Regularization in Vision
Transformers [74.06040005144382]
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications.
We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget.
We train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
arXiv Detail & Related papers (2021-06-18T17:58:20Z) - Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation.
tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z) - CNNs for JPEGs: A Study in Computational Cost [49.97673761305336]
Convolutional neural networks (CNNs) have achieved astonishing advances over the past decade.
CNNs are capable of learning robust representations of the data directly from the RGB pixels.
Deep learning methods capable of learning directly from the compressed domain have been gaining attention in recent years.
arXiv Detail & Related papers (2020-12-26T15:00:10Z) - Curriculum By Smoothing [52.08553521577014]
Convolutional Neural Networks (CNNs) have shown impressive performance in computer vision tasks such as image classification, detection, and segmentation.
We propose an elegant curriculum based scheme that smoothes the feature embedding of a CNN using anti-aliasing or low-pass filters.
As the amount of information in the feature maps increases during training, the network is able to progressively learn better representations of the data.
arXiv Detail & Related papers (2020-03-03T07:27:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.