Related papers: MedViT V2: Medical Image Classification with KAN-Integrated Transformers and Dilated Neighborhood Attention

MedViT V2: Medical Image Classification with KAN-Integrated Transformers and Dilated Neighborhood Attention

URL: http://arxiv.org/abs/2502.13693v2
Date: Tue, 29 Jul 2025 00:25:29 GMT
Title: MedViT V2: Medical Image Classification with KAN-Integrated Transformers and Dilated Neighborhood Attention
Authors: Omid Nejati Manzari, Hojat Asgariandehkordi, Taha Koleilat, Yiming Xiao, Hassan Rivaz,
Abstract summary: We introduce the Medical Vision Transformer (MedViTV2) for generalized medical image classification.<n>MedViTV2 is 44% more computationally efficient than the previous version.<n>It significantly enhances accuracy, achieving improvements of 4.6% on MedMNIST, 5.8% on NonMNIST, and 13.4% on the MedMNIST-C benchmark.
Score: 2.13145300583399
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Convolutional networks, transformers, hybrid models, and Mamba-based architectures have demonstrated strong performance across various medical image classification tasks. However, these methods were primarily designed to classify clean images using labeled data. In contrast, real-world clinical data often involve image corruptions that are unique to multi-center studies and stem from variations in imaging equipment across manufacturers. In this paper, we introduce the Medical Vision Transformer (MedViTV2), a novel architecture incorporating Kolmogorov-Arnold Network (KAN) layers into the transformer architecture for the first time, aiming for generalized medical image classification. We have developed an efficient KAN block to reduce computational load while enhancing the accuracy of the original MedViT. Additionally, to counteract the fragility of our MedViT when scaled up, we propose an enhanced Dilated Neighborhood Attention (DiNA), an adaptation of the efficient fused dot-product attention kernel capable of capturing global context and expanding receptive fields to scale the model effectively and addressing feature collapse issues. Moreover, a hierarchical hybrid strategy is introduced to stack our Local Feature Perception and Global Feature Perception blocks in an efficient manner, which balances local and global feature perceptions to boost performance. Extensive experiments on 17 medical image classification datasets and 12 corrupted medical image datasets demonstrate that MedViTV2 achieved state-of-the-art results in 27 out of 29 experiments with reduced computational complexity. MedViTV2 is 44\% more computationally efficient than the previous version and significantly enhances accuracy, achieving improvements of 4.6\% on MedMNIST, 5.8\% on NonMNIST, and 13.4\% on the MedMNIST-C benchmark.

Related papers

Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation [22.045663130551446]
We propose a mobile model called Mobile U-shaped Vision Transformer (Mobile U-ViT) tailored for medical image segmentation.<n>This design exhibits transformer-like representation learning capacity while being lighter and faster.<n>Despite its reduced computational demands, our architecture achieves state-of-the-art performance across eight public 2D and 3D datasets.
arXiv Detail & Related papers (2025-08-01T20:45:42Z)
HYATT-Net is Grand: A Hybrid Attention Network for Performant Anatomical Landmark Detection [17.290208035331734]
Anatomical landmark detection (ALD) from a medical image is crucial for a wide array of clinical applications.<n>We propose a novel hybrid architecture that integrates CNNs and Transformers.<n> Experiments on five diverse datasets demonstrate state-of-the-art performance, surpassing existing methods in accuracy, robustness, and efficiency.
arXiv Detail & Related papers (2024-12-09T13:58:00Z)
MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation [0.8437187555622164]
This study explores enhancing Vision Transformers (ViTs) for medical image segmentation by integrating pre-trained LLM transformer blocks. Our approach, which incorporates a frozen LLM transformer block into the encoder of a ViT-based model, leads to substantial improvements in segmentation performance. The enhanced model shows significant performance gains, including an average Dice score increase from 0.74 to 0.79 and improvements in accuracy, precision, and the Jaccard Index.
arXiv Detail & Related papers (2024-10-03T14:50:33Z)
Boosting Medical Image Segmentation Performance with Adaptive Convolution Layer [6.887244952811574]
We propose an adaptive layer placed ahead of leading deep-learning models such as UCTransNet. Our approach enhances the network's ability to handle diverse anatomical structures and subtle image details. It consistently outperforms traditional CNNs with fixed kernel sizes with a similar number of parameters.
arXiv Detail & Related papers (2024-04-17T13:18:39Z)
Dual-scale Enhanced and Cross-generative Consistency Learning for Semi-supervised Medical Image Segmentation [49.57907601086494]
Medical image segmentation plays a crucial role in computer-aided diagnosis. We propose a novel Dual-scale Enhanced and Cross-generative consistency learning framework for semi-supervised medical image (DEC-Seg)
arXiv Detail & Related papers (2023-12-26T12:56:31Z)
Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection [76.11864242047074]
We propose a novel Affine-Consistent Transformer (AC-Former), which directly yields a sequence of nucleus positions. We introduce an Adaptive Affine Transformer (AAT) module, which can automatically learn the key spatial transformations to warp original images for local network training. Experimental results demonstrate that the proposed method significantly outperforms existing state-of-the-art algorithms on various benchmarks.
arXiv Detail & Related papers (2023-10-22T02:27:02Z)
Breast Ultrasound Tumor Classification Using a Hybrid Multitask CNN-Transformer Network [63.845552349914186]
Capturing global contextual information plays a critical role in breast ultrasound (BUS) image classification. Vision Transformers have an improved capability of capturing global contextual information but may distort the local image patterns due to the tokenization operations. In this study, we proposed a hybrid multitask deep neural network called Hybrid-MT-ESTAN, designed to perform BUS tumor classification and segmentation.
arXiv Detail & Related papers (2023-08-04T01:19:32Z)
AMIGO: Sparse Multi-Modal Graph Transformer with Shared-Context Processing for Representation Learning of Giga-pixel Images [53.29794593104923]
We present a novel concept of shared-context processing for whole slide histopathology images. AMIGO uses the celluar graph within the tissue to provide a single representation for a patient. We show that our model is strongly robust to missing information to an extent that it can achieve the same performance with as low as 20% of the data.
arXiv Detail & Related papers (2023-03-01T23:37:45Z)
MedSegDiff-V2: Diffusion based Medical Image Segmentation with Transformer [53.575573940055335]
We propose a novel Transformer-based Diffusion framework, called MedSegDiff-V2. We verify its effectiveness on 20 medical image segmentation tasks with different image modalities.
arXiv Detail & Related papers (2023-01-19T03:42:36Z)
Data-Efficient Vision Transformers for Multi-Label Disease Classification on Chest Radiographs [55.78588835407174]
Vision Transformers (ViTs) have not been applied to this task despite their high classification performance on generic images. ViTs do not rely on convolutions but on patch-based self-attention and in contrast to CNNs, no prior knowledge of local connectivity is present. Our results show that while the performance between ViTs and CNNs is on par with a small benefit for ViTs, DeiTs outperform the former if a reasonably large data set is available for training.
arXiv Detail & Related papers (2022-08-17T09:07:45Z)
AlignTransformer: Hierarchical Alignment of Visual Regions and Disease Tags for Medical Report Generation [50.21065317817769]
We propose an AlignTransformer framework, which includes the Align Hierarchical Attention (AHA) and the Multi-Grained Transformer (MGT) modules. Experiments on the public IU-Xray and MIMIC-CXR datasets show that the AlignTransformer can achieve results competitive with state-of-the-art methods on the two datasets.
arXiv Detail & Related papers (2022-03-18T13:43:53Z)
PHTrans: Parallelly Aggregating Global and Local Representations for Medical Image Segmentation [7.140322699310487]
We propose a novel hybrid architecture for medical image segmentation called PHTrans. PHTrans parallelly hybridizes Transformer and CNN in main building blocks to produce hierarchical representations from global and local features.
arXiv Detail & Related papers (2022-03-09T08:06:56Z)
Medical Transformer: Gated Axial-Attention for Medical Image Segmentation [73.98974074534497]
We study the feasibility of using Transformer-based network architectures for medical image segmentation tasks. We propose a Gated Axial-Attention model which extends the existing architectures by introducing an additional control mechanism in the self-attention module. To train the model effectively on medical images, we propose a Local-Global training strategy (LoGo) which further improves the performance.
arXiv Detail & Related papers (2021-02-21T18:35:14Z)
Contrastive Cross-site Learning with Redesigned Net for COVID-19 CT Classification [20.66003113364796]
The pandemic of coronavirus disease 2019 (COVID-19) has lead to a global public health crisis spreading hundreds of countries. To assist the clinical diagnosis and reduce the tedious workload of image interpretation, developing automated tools for COVID-19 identification with CT image is highly desired. This paper proposes a novel joint learning framework to perform accurate COVID-19 identification by effectively learning with heterogeneous datasets.
arXiv Detail & Related papers (2020-09-15T11:09:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.