ChA-MAEViT: Unifying Channel-Aware Masked Autoencoders and Multi-Channel Vision Transformers for Improved Cross-Channel Learning
- URL: http://arxiv.org/abs/2503.19331v1
- Date: Tue, 25 Mar 2025 03:45:59 GMT
- Title: ChA-MAEViT: Unifying Channel-Aware Masked Autoencoders and Multi-Channel Vision Transformers for Improved Cross-Channel Learning
- Authors: Chau Pham, Juan C. Caicedo, Bryan A. Plummer,
- Abstract summary: ChA-MAEViT enhances feature learning across Multi-Channel Imaging (MCI) channels via four key strategies.<n>ChA-MAEViT significantly outperforms state-of-the-art MCI-ViTs by 3.0-21.5%.
- Score: 17.04905100460915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prior work using Masked Autoencoders (MAEs) typically relies on random patch masking based on the assumption that images have significant redundancies across different channels, allowing for the reconstruction of masked content using cross-channel correlations. However, this assumption does not hold in Multi-Channel Imaging (MCI), where channels may provide complementary information with minimal feature overlap. Thus, these MAEs primarily learn local structures within individual channels from patch reconstruction, failing to fully leverage cross-channel interactions and limiting their MCI effectiveness. In this paper, we present ChA-MAEViT, an MAE-based method that enhances feature learning across MCI channels via four key strategies: (1) dynamic channel-patch masking, which compels the model to reconstruct missing channels in addition to masked patches, thereby enhancing cross-channel dependencies and improving robustness to varying channel configurations; (2) memory tokens, which serve as long-term memory aids to promote information sharing across channels, addressing the challenges of reconstructing structurally diverse channels; (3) hybrid token fusion module, which merges fine-grained patch tokens with a global class token to capture richer representations; and (4) Channel-Aware Decoder, a lightweight decoder utilizes channel tokens to effectively reconstruct image patches. Experiments on satellite and microscopy datasets, CHAMMI, JUMP-CP, and So2Sat, show that ChA-MAEViT significantly outperforms state-of-the-art MCI-ViTs by 3.0-21.5%, highlighting the importance of cross-channel interactions in MCI.
Related papers
- Isolated Channel Vision Transformers: From Single-Channel Pretraining to Multi-Channel Finetuning [3.4170567485926373]
We introduce a simple yet effective pretraining framework for large-scale MCI datasets.<n>Our method, named Isolated Channel ViT (IC-ViT), patchifies image channels individually and thereby enables pretraining for multimodal multi-channel tasks.<n> Experiments on various tasks and benchmarks, including JUMP-CP and CHAMMI for cell microscopy imaging, and So2Sat-LCZ42 for satellite imaging, show that the proposed IC-ViT delivers 4-14 percentage points of performance improvement.
arXiv Detail & Related papers (2025-03-12T20:45:02Z) - Enhancing Feature Diversity Boosts Channel-Adaptive Vision Transformers [18.731717752379232]
Multi-Channel Imaging (MCI) models must support a variety of channel configurations at test time.
Recent work has extended traditional visual encoders for MCI, such as Vision Transformers (ViT), by supplementing pixel information with an encoding representing the channel configuration.
We propose DiChaViT, which aims to enhance the diversity in the learned features of MCI-ViT models.
arXiv Detail & Related papers (2024-05-26T03:41:40Z) - SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection [46.049401912285134]
Infrared small target detection (IRSTD) has recently benefitted greatly from U-shaped neural models.
Existing techniques struggle when the target has high similarities with the background.
We present a Spatial-channel Cross Transformer Network (SCTransNet) that leverages spatial-channel cross transformer blocks.
arXiv Detail & Related papers (2024-01-28T06:41:15Z) - Joint Channel Estimation and Feedback with Masked Token Transformers in
Massive MIMO Systems [74.52117784544758]
This paper proposes an encoder-decoder based network that unveils the intrinsic frequency-domain correlation within the CSI matrix.
The entire encoder-decoder network is utilized for channel compression.
Our method outperforms state-of-the-art channel estimation and feedback techniques in joint tasks.
arXiv Detail & Related papers (2023-06-08T06:15:17Z) - Efficient Multi-Scale Attention Module with Cross-Spatial Learning [4.046170185945849]
A novel efficient multi-scale attention (EMA) module is proposed.
We focus on retaining the information on per channel and decreasing the computational overhead.
We conduct extensive ablation studies and experiments on image classification and object detection tasks.
arXiv Detail & Related papers (2023-05-23T00:35:47Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z) - Channel-wise Knowledge Distillation for Dense Prediction [73.99057249472735]
We propose to align features channel-wise between the student and teacher networks.
We consistently achieve superior performance on three benchmarks with various network structures.
arXiv Detail & Related papers (2020-11-26T12:00:38Z) - Dual Attention GANs for Semantic Image Synthesis [101.36015877815537]
We propose a novel Dual Attention GAN (DAGAN) to synthesize photo-realistic and semantically-consistent images.
We also propose two novel modules, i.e., position-wise Spatial Attention Module (SAM) and scale-wise Channel Attention Module (CAM)
DAGAN achieves remarkably better results than state-of-the-art methods, while using fewer model parameters.
arXiv Detail & Related papers (2020-08-29T17:49:01Z) - Single Image Super-Resolution via a Holistic Attention Network [87.42409213909269]
We propose a new holistic attention network (HAN) to model the holistic interdependencies among layers, channels, and positions.
The proposed HAN adaptively emphasizes hierarchical features by considering correlations among layers.
Experiments demonstrate that the proposed HAN performs favorably against the state-of-the-art single image super-resolution approaches.
arXiv Detail & Related papers (2020-08-20T04:13:15Z) - Channel-Level Variable Quantization Network for Deep Image Compression [50.3174629451739]
We propose a channel-level variable quantization network to dynamically allocate more convolutions for significant channels and withdraws for negligible channels.
Our method achieves superior performance and can produce much better visual reconstructions.
arXiv Detail & Related papers (2020-07-15T07:20:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.