Enhancing Feature Diversity Boosts Channel-Adaptive Vision Transformers
- URL: http://arxiv.org/abs/2405.16419v2
- Date: Mon, 28 Oct 2024 13:07:20 GMT
- Title: Enhancing Feature Diversity Boosts Channel-Adaptive Vision Transformers
- Authors: Chau Pham, Bryan A. Plummer,
- Abstract summary: Multi-Channel Imaging (MCI) models must support a variety of channel configurations at test time.
Recent work has extended traditional visual encoders for MCI, such as Vision Transformers (ViT), by supplementing pixel information with an encoding representing the channel configuration.
We propose DiChaViT, which aims to enhance the diversity in the learned features of MCI-ViT models.
- Score: 18.731717752379232
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-Channel Imaging (MCI) contains an array of challenges for encoding useful feature representations not present in traditional images. For example, images from two different satellites may both contain RGB channels, but the remaining channels can be different for each imaging source. Thus, MCI models must support a variety of channel configurations at test time. Recent work has extended traditional visual encoders for MCI, such as Vision Transformers (ViT), by supplementing pixel information with an encoding representing the channel configuration. However, these methods treat each channel equally, i.e., they do not consider the unique properties of each channel type, which can result in needless and potentially harmful redundancies in the learned features. For example, if RGB channels are always present, the other channels can focus on extracting information that cannot be captured by the RGB channels. To this end, we propose DiChaViT, which aims to enhance the diversity in the learned features of MCI-ViT models. This is achieved through a novel channel sampling strategy that encourages the selection of more distinct channel sets for training. Additionally, we employ regularization and initialization techniques to increase the likelihood that new information is learned from each channel. Many of our improvements are architecture agnostic and can be incorporated into new architectures as they are developed. Experiments on both satellite and cell microscopy datasets, CHAMMI, JUMP-CP, and So2Sat, report DiChaViT yields a 1.5 - 5.0% gain over the state-of-the-art. Our code is publicly available at https://github.com/chaudatascience/diverse_channel_vit.
Related papers
- ChA-MAEViT: Unifying Channel-Aware Masked Autoencoders and Multi-Channel Vision Transformers for Improved Cross-Channel Learning [17.04905100460915]
ChA-MAEViT enhances feature learning across Multi-Channel Imaging (MCI) channels via four key strategies.
ChA-MAEViT significantly outperforms state-of-the-art MCI-ViTs by 3.0-21.5%.
arXiv Detail & Related papers (2025-03-25T03:45:59Z) - Isolated Channel Vision Transformers: From Single-Channel Pretraining to Multi-Channel Finetuning [3.4170567485926373]
We introduce a simple yet effective pretraining framework for large-scale MCI datasets.
Our method, named Isolated Channel ViT (IC-ViT), patchifies image channels individually and thereby enables pretraining for multimodal multi-channel tasks.
Experiments on various tasks and benchmarks, including JUMP-CP and CHAMMI for cell microscopy imaging, and So2Sat-LCZ42 for satellite imaging, show that the proposed IC-ViT delivers 4-14 percentage points of performance improvement.
arXiv Detail & Related papers (2025-03-12T20:45:02Z) - ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images [2.954116522244175]
We propose ChAda-ViT, a novel Channel Adaptive Vision Transformer architecture.
We also introduce IDRCell100k, a bioimage dataset with a rich set of 79 experiments covering 7 microscope modalities.
Our architecture, trained in a self-supervised manner, outperforms existing approaches in several biologically relevant downstream tasks.
arXiv Detail & Related papers (2023-11-26T10:38:47Z) - Recaptured Raw Screen Image and Video Demoir\'eing via Channel and
Spatial Modulations [16.122531943812465]
We propose an image and video demoir'eing network tailored for raw inputs.
We introduce a color-separated feature branch, and it is fused with the traditional feature-mixed branch via channel and spatial modulations.
Experiments demonstrate that our method achieves state-of-the-art performance for both image and video demori'eing.
arXiv Detail & Related papers (2023-10-31T10:19:28Z) - Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words [7.210982964205077]
Vision Transformer (ViT) has emerged as a powerful architecture in modern computer vision.
However, its application in certain imaging fields, such as microscopy and satellite imaging, presents unique challenges.
We propose a modification to the ViT architecture that enhances reasoning across the input channels.
arXiv Detail & Related papers (2023-09-28T02:20:59Z) - Bridging the Gap Between Vision Transformers and Convolutional Neural
Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets.
We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases.
Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z) - Channel-wise Knowledge Distillation for Dense Prediction [73.99057249472735]
We propose to align features channel-wise between the student and teacher networks.
We consistently achieve superior performance on three benchmarks with various network structures.
arXiv Detail & Related papers (2020-11-26T12:00:38Z) - Volumetric Transformer Networks [88.85542905676712]
We introduce a learnable module, the volumetric transformer network (VTN)
VTN predicts channel-wise warping fields so as to reconfigure intermediate CNN features spatially and channel-wisely.
Our experiments show that VTN consistently boosts the features' representation power and consequently the networks' accuracy on fine-grained image recognition and instance-level image retrieval.
arXiv Detail & Related papers (2020-07-18T14:00:12Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z) - Channel-Level Variable Quantization Network for Deep Image Compression [50.3174629451739]
We propose a channel-level variable quantization network to dynamically allocate more convolutions for significant channels and withdraws for negligible channels.
Our method achieves superior performance and can produce much better visual reconstructions.
arXiv Detail & Related papers (2020-07-15T07:20:39Z) - Channel Interaction Networks for Fine-Grained Image Categorization [61.095320862647476]
Fine-grained image categorization is challenging due to the subtle inter-class differences.
We propose a channel interaction network (CIN), which models the channel-wise interplay both within an image and across images.
Our model can be trained efficiently in an end-to-end fashion without the need of multi-stage training and testing.
arXiv Detail & Related papers (2020-03-11T11:51:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.