Related papers: Isolated Channel Vision Transformers: From Single-Channel Pretraining to Multi-Channel Finetuning

Isolated Channel Vision Transformers: From Single-Channel Pretraining to Multi-Channel Finetuning

URL: http://arxiv.org/abs/2503.09826v1
Date: Wed, 12 Mar 2025 20:45:02 GMT
Title: Isolated Channel Vision Transformers: From Single-Channel Pretraining to Multi-Channel Finetuning
Authors: Wenyi Lian, Joakim Lindblad, Patrick Micke, Nataša Sladoje,
Abstract summary: We introduce a simple yet effective pretraining framework for large-scale MCI datasets.<n>Our method, named Isolated Channel ViT (IC-ViT), patchifies image channels individually and thereby enables pretraining for multimodal multi-channel tasks.<n> Experiments on various tasks and benchmarks, including JUMP-CP and CHAMMI for cell microscopy imaging, and So2Sat-LCZ42 for satellite imaging, show that the proposed IC-ViT delivers 4-14 percentage points of performance improvement.
Score: 3.4170567485926373
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Transformers (ViTs) have achieved remarkable success in standard RGB image processing tasks. However, applying ViTs to multi-channel imaging (MCI) data, e.g., for medical and remote sensing applications, remains a challenge. In particular, MCI data often consist of layers acquired from different modalities. Directly training ViTs on such data can obscure complementary information and impair the performance. In this paper, we introduce a simple yet effective pretraining framework for large-scale MCI datasets. Our method, named Isolated Channel ViT (IC-ViT), patchifies image channels individually and thereby enables pretraining for multimodal multi-channel tasks. We show that this channel-wise patchifying is a key technique for MCI processing. More importantly, one can pretrain the IC-ViT on single channels and finetune it on downstream multi-channel datasets. This pretraining framework captures dependencies between patches as well as channels and produces robust feature representation. Experiments on various tasks and benchmarks, including JUMP-CP and CHAMMI for cell microscopy imaging, and So2Sat-LCZ42 for satellite imaging, show that the proposed IC-ViT delivers 4-14 percentage points of performance improvement over existing channel-adaptive approaches. Further, its efficient training makes it a suitable candidate for large-scale pretraining of foundation models on heterogeneous data.

Related papers

ChA-MAEViT: Unifying Channel-Aware Masked Autoencoders and Multi-Channel Vision Transformers for Improved Cross-Channel Learning [17.04905100460915]
ChA-MAEViT enhances feature learning across Multi-Channel Imaging (MCI) channels via four key strategies. ChA-MAEViT significantly outperforms state-of-the-art MCI-ViTs by 3.0-21.5%.
arXiv Detail & Related papers (2025-03-25T03:45:59Z)
Your ViT is Secretly an Image Segmentation Model [50.71238842539735]
Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. We show that inductive biases introduced by task-specific components can instead be learned by the ViT itself. We introduce the Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation.
arXiv Detail & Related papers (2025-03-24T19:56:02Z)
Scalable Transformer for High Dimensional Multivariate Time Series Forecasting [10.17270031004674]
This study investigates the reasons behind the suboptimal performance of channel-dependent models on high-dimensional MTS data. We propose STHD, the Scalable Transformer for High-Dimensional Multidimensional Time Series Forecasting. Experiments show STHD's considerable improvement on three high-dimensional datasets: Crime-Chicago, Wiki-People, and Traffic.
arXiv Detail & Related papers (2024-08-08T06:17:13Z)
Enhancing Feature Diversity Boosts Channel-Adaptive Vision Transformers [18.731717752379232]
Multi-Channel Imaging (MCI) models must support a variety of channel configurations at test time. Recent work has extended traditional visual encoders for MCI, such as Vision Transformers (ViT), by supplementing pixel information with an encoding representing the channel configuration. We propose DiChaViT, which aims to enhance the diversity in the learned features of MCI-ViT models.
arXiv Detail & Related papers (2024-05-26T03:41:40Z)
Frequency-Aware Transformer for Learned Image Compression [64.28698450919647]
We propose a frequency-aware transformer (FAT) block that for the first time achieves multiscale directional ananlysis for Learned Image Compression (LIC)<n>The FAT block comprises frequency-decomposition window attention (FDWA) modules to capture multiscale and directional frequency components of natural images.<n>We also introduce frequency-modulation feed-forward network (FMFFN) to adaptively modulate different frequency components, improving rate-distortion performance.
arXiv Detail & Related papers (2023-10-25T05:59:25Z)
Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words [7.210982964205077]
Vision Transformer (ViT) has emerged as a powerful architecture in modern computer vision. However, its application in certain imaging fields, such as microscopy and satellite imaging, presents unique challenges. We propose a modification to the ViT architecture that enhances reasoning across the input channels.
arXiv Detail & Related papers (2023-09-28T02:20:59Z)
CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs [79.54107547233625]
Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks. We propose a joint compression method for ViTs that offers both high accuracy and fast inference speed. Our proposed method can achieve state-of-the-art performance across various ViTs.
arXiv Detail & Related papers (2023-09-27T16:12:07Z)
Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets. We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases. Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z)
Channel-Level Variable Quantization Network for Deep Image Compression [50.3174629451739]
We propose a channel-level variable quantization network to dynamically allocate more convolutions for significant channels and withdraws for negligible channels. Our method achieves superior performance and can produce much better visual reconstructions.
arXiv Detail & Related papers (2020-07-15T07:20:39Z)
Channel Interaction Networks for Fine-Grained Image Categorization [61.095320862647476]
Fine-grained image categorization is challenging due to the subtle inter-class differences. We propose a channel interaction network (CIN), which models the channel-wise interplay both within an image and across images. Our model can be trained efficiently in an end-to-end fashion without the need of multi-stage training and testing.
arXiv Detail & Related papers (2020-03-11T11:51:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.