Related papers: Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words

Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words

URL: http://arxiv.org/abs/2309.16108v4
Date: Fri, 19 Apr 2024 02:05:02 GMT
Title: Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words
Authors: Yujia Bao, Srinivasan Sivanandan, Theofanis Karaletsos,
Abstract summary: Vision Transformer (ViT) has emerged as a powerful architecture in modern computer vision. However, its application in certain imaging fields, such as microscopy and satellite imaging, presents unique challenges. We propose a modification to the ViT architecture that enhances reasoning across the input channels.
Score: 7.210982964205077
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Transformer (ViT) has emerged as a powerful architecture in the realm of modern computer vision. However, its application in certain imaging fields, such as microscopy and satellite imaging, presents unique challenges. In these domains, images often contain multiple channels, each carrying semantically distinct and independent information. Furthermore, the model must demonstrate robustness to sparsity in input channels, as they may not be densely available during training or testing. In this paper, we propose a modification to the ViT architecture that enhances reasoning across the input channels and introduce Hierarchical Channel Sampling (HCS) as an additional regularization technique to ensure robustness when only partial channels are presented during test time. Our proposed model, ChannelViT, constructs patch tokens independently from each input channel and utilizes a learnable channel embedding that is added to the patch tokens, similar to positional embeddings. We evaluate the performance of ChannelViT on ImageNet, JUMP-CP (microscopy cell imaging), and So2Sat (satellite imaging). Our results show that ChannelViT outperforms ViT on classification tasks and generalizes well, even when a subset of input channels is used during testing. Across our experiments, HCS proves to be a powerful regularizer, independent of the architecture employed, suggesting itself as a straightforward technique for robust ViT training. Lastly, we find that ChannelViT generalizes effectively even when there is limited access to all channels during training, highlighting its potential for multi-channel imaging under real-world conditions with sparse sensors. Our code is available at https://github.com/insitro/ChannelViT.

Related papers

Your ViT is Secretly an Image Segmentation Model [50.71238842539735]
Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. We show that inductive biases introduced by task-specific components can instead be learned by the ViT itself. We introduce the Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation.
arXiv Detail & Related papers (2025-03-24T19:56:02Z)
Isolated Channel Vision Transformers: From Single-Channel Pretraining to Multi-Channel Finetuning [3.4170567485926373]
We introduce a simple yet effective pretraining framework for large-scale MCI datasets. Our method, named Isolated Channel ViT (IC-ViT), patchifies image channels individually and thereby enables pretraining for multimodal multi-channel tasks. Experiments on various tasks and benchmarks, including JUMP-CP and CHAMMI for cell microscopy imaging, and So2Sat-LCZ42 for satellite imaging, show that the proposed IC-ViT delivers 4-14 percentage points of performance improvement.
arXiv Detail & Related papers (2025-03-12T20:45:02Z)
Enhancing Feature Diversity Boosts Channel-Adaptive Vision Transformers [18.731717752379232]
Multi-Channel Imaging (MCI) models must support a variety of channel configurations at test time. Recent work has extended traditional visual encoders for MCI, such as Vision Transformers (ViT), by supplementing pixel information with an encoding representing the channel configuration. We propose DiChaViT, which aims to enhance the diversity in the learned features of MCI-ViT models.
arXiv Detail & Related papers (2024-05-26T03:41:40Z)
CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs [79.54107547233625]
Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks. We propose a joint compression method for ViTs that offers both high accuracy and fast inference speed. Our proposed method can achieve state-of-the-art performance across various ViTs.
arXiv Detail & Related papers (2023-09-27T16:12:07Z)
Patch Is Not All You Need [57.290256181083016]
We propose a novel Pattern Transformer to adaptively convert images to pattern sequences for Transformer input. We employ the Convolutional Neural Network to extract various patterns from the input image. We have accomplished state-of-the-art performance on CIFAR-10 and CIFAR-100, and have achieved competitive results on ImageNet.
arXiv Detail & Related papers (2023-08-21T13:54:00Z)
WITT: A Wireless Image Transmission Transformer for Semantic Communications [11.480385893433802]
We redesign vision Transformer (ViT) as a new backbone to realize wireless image transmission transformer (WITT) WITT is highly optimized for image transmission while considering the effect of the wireless channel. Our experiments verify that our WITT attains better performance for different image resolutions, distortion metrics, and channel conditions.
arXiv Detail & Related papers (2022-11-02T07:50:27Z)
Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets. We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases. Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z)
Image Captioning In the Transformer Age [71.06437715212911]
Image Captioning (IC) has achieved astonishing developments by incorporating various techniques into the CNN-RNN encoder-decoder architecture. This paper analyzes the connections between IC with some popular self-supervised learning paradigms.
arXiv Detail & Related papers (2022-04-15T08:13:39Z)
DeepJSCC-Q: Channel Input Constrained Deep Joint Source-Channel Coding [5.046831208137847]
DeepJSCC-Q is an end-to-end optimized joint source-channel coding scheme for wireless image transmission. It preserves the graceful degradation of image quality observed in prior work when channel conditions worsen.
arXiv Detail & Related papers (2021-11-25T11:59:17Z)
Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems. We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN) We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z)
Wireless Image Retrieval at the Edge [20.45405359815043]
We study the image retrieval problem at the wireless edge, where an edge device captures an image, which is then used to retrieve similar images from an edge server. Our goal is to maximize the accuracy of the retrieval task under power and bandwidth constraints over the wireless link. We propose two alternative schemes based on digital and analog communications, respectively.
arXiv Detail & Related papers (2020-07-21T16:15:40Z)
Channel Interaction Networks for Fine-Grained Image Categorization [61.095320862647476]
Fine-grained image categorization is challenging due to the subtle inter-class differences. We propose a channel interaction network (CIN), which models the channel-wise interplay both within an image and across images. Our model can be trained efficiently in an end-to-end fashion without the need of multi-stage training and testing.
arXiv Detail & Related papers (2020-03-11T11:51:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.