EVM-Fusion: An Explainable Vision Mamba Architecture with Neural Algorithmic Fusion
- URL: http://arxiv.org/abs/2505.17367v3
- Date: Tue, 27 May 2025 16:57:06 GMT
- Title: EVM-Fusion: An Explainable Vision Mamba Architecture with Neural Algorithmic Fusion
- Authors: Zichuan Yang,
- Abstract summary: EVM-Fusion is an Explainable Vision Mamba architecture featuring a novel Algorithm Neuralic Fusion (NAF) mechanism for medical image classification.<n>Experiments on a diverse 9-class multi-organ medical image dataset demonstrate EVM-Fusion's strong classification performance, achieving 99.75% test accuracy.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Medical image classification is critical for clinical decision-making, yet demands for accuracy, interpretability, and generalizability remain challenging. This paper introduces EVM-Fusion, an Explainable Vision Mamba architecture featuring a novel Neural Algorithmic Fusion (NAF) mechanism for multi-organ medical image classification. EVM-Fusion leverages a multipath design, where DenseNet and U-Net based pathways, enhanced by Vision Mamba (Vim) modules, operate in parallel with a traditional feature pathway. These diverse features are dynamically integrated via a two-stage fusion process: cross-modal attention followed by the iterative NAF block, which learns an adaptive fusion algorithm. Intrinsic explainability is embedded through path-specific spatial attention, Vim {\Delta}-value maps, traditional feature SE-attention, and cross-modal attention weights. Experiments on a diverse 9-class multi-organ medical image dataset demonstrate EVM-Fusion's strong classification performance, achieving 99.75% test accuracy and provide multi-faceted insights into its decision-making process, highlighting its potential for trustworthy AI in medical diagnostics.
Related papers
- Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation [56.52520416420957]
We propose Multimodal Causal-Driven Representation Learning (MCDRL) to tackle domain generalization in medical image segmentation.<n>MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
arXiv Detail & Related papers (2025-08-07T03:41:41Z) - NEARL-CLIP: Interacted Query Adaptation with Orthogonal Regularization for Medical Vision-Language Understanding [51.63264715941068]
textbfNEARL-CLIP (iunderlineNteracted quunderlineEry underlineAdaptation with ounderlineRthogonaunderlineL Regularization) is a novel cross-modality interaction VLM-based framework.
arXiv Detail & Related papers (2025-08-06T05:44:01Z) - Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion [12.839049648094893]
coronary artery segmentation is critical for computeraided diagnosis of coronary artery disease (CAD)<n>We propose a novel framework that leverages the power of vision foundation models (VFMs) through a parallel encoding architecture.<n>The proposed framework significantly outperforms state-of-the-art methods, achieving superior performance in accurate coronary artery segmentation.
arXiv Detail & Related papers (2025-07-17T09:25:00Z) - Mamba Based Feature Extraction And Adaptive Multilevel Feature Fusion For 3D Tumor Segmentation From Multi-modal Medical Image [8.999013226631893]
Multi-modal 3D medical image segmentation aims to accurately identify tumor regions across different modalities.<n>Traditional convolutional neural network (CNN)-based methods struggle with capturing global features.<n>Transformers-based methods, despite effectively capturing global context, encounter high computational costs in 3D medical image segmentation.
arXiv Detail & Related papers (2025-04-30T03:29:55Z) - MicarVLMoE: A Modern Gated Cross-Aligned Vision-Language Mixture of Experts Model for Medical Image Captioning and Report Generation [4.760537994346813]
Medical image reporting aims to generate structured clinical descriptions from radiological images.<n>We propose MicarVLMoE, a vision-language mixture-of-experts model with gated cross-aligned fusion.<n>We extend MIR to CT scans, retinal imaging, MRI scans, and gross pathology images, reporting state-of-the-art results.
arXiv Detail & Related papers (2025-04-29T01:26:02Z) - Edge-Enhanced Dilated Residual Attention Network for Multimodal Medical Image Fusion [13.029564509505676]
Multimodal medical image fusion is a crucial task that combines complementary information from different imaging modalities into a unified representation.
While deep learning methods have significantly advanced fusion performance, some of the existing CNN-based methods fall short in capturing fine-grained multiscale and edge features.
We propose a novel CNN-based architecture that addresses these limitations by introducing a Dilated Residual Attention Network Module for effective multiscale feature extraction.
arXiv Detail & Related papers (2024-11-18T18:11:53Z) - Counterfactual Explanations for Medical Image Classification and Regression using Diffusion Autoencoder [38.81441978142279]
We propose a novel method that operates directly on the latent space of a generative model, specifically a Diffusion Autoencoder (DAE)
This approach offers inherent interpretability by enabling the generation of Counterfactual explanations (CEs)
We show that these latent representations are helpful for medical condition classification and the ordinal regression of pathologies, such as vertebral compression fractures (VCF) and diabetic retinopathy (DR)
arXiv Detail & Related papers (2024-08-02T21:01:30Z) - MindFormer: Semantic Alignment of Multi-Subject fMRI for Brain Decoding [50.55024115943266]
We introduce a novel semantic alignment method of multi-subject fMRI signals using so-called MindFormer.
This model is specifically designed to generate fMRI-conditioned feature vectors that can be used for conditioning Stable Diffusion model for fMRI- to-image generation or large language model (LLM) for fMRI-to-text generation.
Our experimental results demonstrate that MindFormer generates semantically consistent images and text across different subjects.
arXiv Detail & Related papers (2024-05-28T00:36:25Z) - An Interpretable Cross-Attentive Multi-modal MRI Fusion Framework for Schizophrenia Diagnosis [46.58592655409785]
We propose a novel Cross-Attentive Multi-modal Fusion framework (CAMF) to capture both intra-modal and inter-modal relationships between fMRI and sMRI.
Our approach significantly improves classification accuracy, as demonstrated by our evaluations on two extensive multi-modal brain imaging datasets.
The gradient-guided Score-CAM is applied to interpret critical functional networks and brain regions involved in schizophrenia.
arXiv Detail & Related papers (2024-03-29T20:32:30Z) - Semi-Mamba-UNet: Pixel-Level Contrastive and Pixel-Level Cross-Supervised Visual Mamba-based UNet for Semi-Supervised Medical Image Segmentation [11.637738540262797]
This study introduces Semi-Mamba-UNet, which integrates a purely visual Mamba-based encoder-decoder architecture with a conventional CNN-based UNet into a semi-supervised learning framework.
This innovative SSL approach leverages both networks to generate pseudo-labels and cross-supervise one another at the pixel level simultaneously.
We introduce a self-supervised pixel-level contrastive learning strategy that employs a pair of projectors to enhance the feature learning capabilities further.
arXiv Detail & Related papers (2024-02-11T17:09:21Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Diff-UNet: A Diffusion Embedded Network for Volumetric Segmentation [41.608617301275935]
We propose a novel end-to-end framework, called Diff-UNet, for medical volumetric segmentation.
Our approach integrates the diffusion model into a standard U-shaped architecture to extract semantic information from the input volume effectively.
We evaluate our method on three datasets, including multimodal brain tumors in MRI, liver tumors, and multi-organ CT volumes.
arXiv Detail & Related papers (2023-03-18T04:06:18Z) - MedSegDiff-V2: Diffusion based Medical Image Segmentation with
Transformer [53.575573940055335]
We propose a novel Transformer-based Diffusion framework, called MedSegDiff-V2.
We verify its effectiveness on 20 medical image segmentation tasks with different image modalities.
arXiv Detail & Related papers (2023-01-19T03:42:36Z) - TransFusion: Multi-view Divergent Fusion for Medical Image Segmentation
with Transformers [8.139069987207494]
We present TransFusion, a Transformer-based architecture to merge divergent multi-view imaging information using convolutional layers and powerful attention mechanisms.
In particular, the Divergent Fusion Attention (DiFA) module is proposed for rich cross-view context modeling and semantic dependency mining.
arXiv Detail & Related papers (2022-03-21T04:02:54Z) - MEDUSA: Multi-scale Encoder-Decoder Self-Attention Deep Neural Network
Architecture for Medical Image Analysis [71.2022403915147]
We introduce MEDUSA, a multi-scale encoder-decoder self-attention mechanism tailored for medical image analysis.
We obtain state-of-the-art performance on challenging medical image analysis benchmarks including COVIDx, RSNA RICORD, and RSNA Pneumonia Challenge.
arXiv Detail & Related papers (2021-10-12T15:05:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.