Hybrid Vision Transformer-Mamba Framework for Autism Diagnosis via Eye-Tracking Analysis
- URL: http://arxiv.org/abs/2506.06886v1
- Date: Sat, 07 Jun 2025 18:27:24 GMT
- Title: Hybrid Vision Transformer-Mamba Framework for Autism Diagnosis via Eye-Tracking Analysis
- Authors: Wafaa Kasri, Yassine Himeur, Abigail Copiaco, Wathiq Mansoor, Ammar Albanna, Valsamma Eapen,
- Abstract summary: This study presents a hybrid deep learning framework combining Vision Transformers (ViT) and Vision Mamba to detect ASD.<n>The model uses attention-based fusion to integrate visual, speech, and facial cues, capturing both spatial and temporal dynamics.<n>Tested on the Saliency4ASD dataset, the proposed ViT-Mamba model outperformed existing methods, achieving 0.96 accuracy, 0.95 F1-score, 0.97 sensitivity, and 0.94 specificity.
- Score: 2.481802259298367
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurate Autism Spectrum Disorder (ASD) diagnosis is vital for early intervention. This study presents a hybrid deep learning framework combining Vision Transformers (ViT) and Vision Mamba to detect ASD using eye-tracking data. The model uses attention-based fusion to integrate visual, speech, and facial cues, capturing both spatial and temporal dynamics. Unlike traditional handcrafted methods, it applies state-of-the-art deep learning and explainable AI techniques to enhance diagnostic accuracy and transparency. Tested on the Saliency4ASD dataset, the proposed ViT-Mamba model outperformed existing methods, achieving 0.96 accuracy, 0.95 F1-score, 0.97 sensitivity, and 0.94 specificity. These findings show the model's promise for scalable, interpretable ASD screening, especially in resource-constrained or remote clinical settings where access to expert diagnosis is limited.
Related papers
- NEURO-GUARD: Neuro-Symbolic Generalization and Unbiased Adaptive Routing for Diagnostics -- Explainable Medical AI [0.6345042809319409]
We present NEURO-GUARD, a knowledge-guided vision framework that integrates Vision Transformers (ViTs) with language-driven reasoning to improve performance.<n> NEURO-GUARD employs a retrieval-augmented generation (RAG) mechanism for self-verification, in which a large language model (LLM) iteratively generates, evaluates, and refines feature-extraction code for medical images.<n>Experiments on diabetic retinopathy classification across four benchmark datasets demonstrate that NEURO-GUARD improves accuracy by 6.2% over a ViT-only baseline and achieves a 5% gain in domain generalization.
arXiv Detail & Related papers (2025-12-20T02:32:15Z) - An Explainable Hybrid AI Framework for Enhanced Tuberculosis and Symptom Detection [55.35661671061754]
Tuberculosis remains a critical global health issue, particularly in resource-limited and remote areas.<n>We propose a framework which enhances disease and symptom detection on chest X-rays by integrating two supervised heads and a self-supervised head.<n>Our model achieves an accuracy of 98.85% for distinguishing between COVID-19, tuberculosis, and normal cases, and a macro-F1 score of 90.09% for multilabel symptom detection.
arXiv Detail & Related papers (2025-10-21T17:18:55Z) - RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis [56.373297358647655]
Retrieval-Augmented Diagnosis (RAD) is a novel framework that injects external knowledge into multimodal models directly on downstream tasks.<n>RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss transformer, and a dual decoder.
arXiv Detail & Related papers (2025-09-24T10:36:14Z) - Latent Diffusion Autoencoders: Toward Efficient and Meaningful Unsupervised Representation Learning in Medical Imaging [41.446379453352534]
Latent Diffusion Autoencoder (LDAE) is a novel encoder-decoder diffusion-based framework for efficient and meaningful unsupervised learning in medical imaging.<n>This study focuses on Alzheimer disease (AD) using brain MR from the ADNI database as a case study.
arXiv Detail & Related papers (2025-04-11T15:37:46Z) - Retuve: Automated Multi-Modality Analysis of Hip Dysplasia with Open Source AI [35.088124182314075]
Developmental of the hip (DDH) poses significant diagnostic challenges, hindering timely intervention.<n>To address these limitations, we introduce Retuve, an open-source framework for multi-modality DDH analysis.<n>By adhering to open-source principles, Retuve promotes transparency, collaboration, and accessibility in DDH research.
arXiv Detail & Related papers (2025-04-08T20:41:21Z) - GS-TransUNet: Integrated 2D Gaussian Splatting and Transformer UNet for Accurate Skin Lesion Analysis [44.99833362998488]
We present a novel approach that combines 2D Gaussian splatting with the Transformer UNet architecture for automated skin cancer diagnosis.<n>Our findings illustrate significant advancements in the precision of segmentation and classification.<n>This integration sets new benchmarks in the field and highlights the potential for further research into multi-task medical image analysis methodologies.
arXiv Detail & Related papers (2025-02-23T23:28:47Z) - Efficient and Comprehensive Feature Extraction in Large Vision-Language Model for Pathology Analysis [37.11302829771659]
Large vision-language models (LVLMs) are limited by input resolution constraints, hindering their efficiency and accuracy in pathology image analysis.<n>We propose two innovative strategies: the mixed task-guided feature enhancement, and the prompt-guided detail feature completion.<n>We trained the pathology-specialized LVLM, OmniPath, which significantly outperforms existing methods in diagnostic accuracy and efficiency.
arXiv Detail & Related papers (2024-12-12T18:07:23Z) - Understanding and Improving Training-Free AI-Generated Image Detections with Vision Foundation Models [68.90917438865078]
Deepfake techniques for facial synthesis and editing pose serious risks for generative models.<n>In this paper, we investigate how detection performance varies across model backbones, types, and datasets.<n>We introduce Contrastive Blur, which enhances performance on facial images, and MINDER, which addresses noise type bias, balancing performance across domains.
arXiv Detail & Related papers (2024-11-28T13:04:45Z) - Advanced Gesture Recognition for Autism Spectrum Disorder Detection: Integrating YOLOv7, Video Augmentation, and VideoMAE for Naturalistic Video Analysis [10.298059998417104]
Repetitive motor behaviors such as spinning, head banging, and arm flapping are key indicators for diagnosis of autism spectrum disorder (ASD)<n>This study focuses on distinguishing between children with ASD and typically developed (TD) peers by analyzing videos captured in natural, uncontrolled environments.<n>We adopt a pipeline integrating YOLOv7-based detection, extensive video augmentations, and the VideoMAE framework, which efficiently captures both spatial and temporal features through a high-ratio masking and reconstruction strategy.
arXiv Detail & Related papers (2024-10-12T02:55:37Z) - Analyzing the Effect of $k$-Space Features in MRI Classification Models [0.0]
We have developed an explainable AI methodology tailored for medical imaging.
We employ a Convolutional Neural Network (CNN) that analyzes MRI scans across both image and frequency domains.
This approach not only enhances early training efficiency but also deepens our understanding of how additional features impact the model predictions.
arXiv Detail & Related papers (2024-09-20T15:43:26Z) - Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data [0.29687381456163997]
Early diagnosis and intervention for Autism Spectrum Disorder (ASD) has been shown to significantly improve the quality of life of autistic individuals.<n>There is a need for objective biomarkers of ASD which can help improve diagnostic accuracy.<n>Deep learning (DL) has achieved outstanding performance in diagnosing diseases and conditions from medical imaging data.<n>This research aims to improve the accuracy and interpretability of ASD diagnosis by creating a DL model that can not only accurately classify ASD but also provide explainable insights into its working.
arXiv Detail & Related papers (2024-09-19T23:08:09Z) - Ensemble Modeling of Multiple Physical Indicators to Dynamically Phenotype Autism Spectrum Disorder [3.6630139570443996]
We provide a dataset for training computer vision models to detect Autism Spectrum Disorder (ASD)-related phenotypic markers.
We trained individual LSTM-based models using eye gaze, head positions, and facial landmarks as input features, achieving test AUCs of 86%, 67%, and 78%.
arXiv Detail & Related papers (2024-08-23T17:55:58Z) - Shifting Focus: From Global Semantics to Local Prominent Features in Swin-Transformer for Knee Osteoarthritis Severity Assessment [42.09313885494969]
We harness the Swin Transformer's capacity to discern extended spatial dependencies within images through the hierarchical framework.
Our novel contribution lies in refining local feature representations, orienting them specifically toward the final distribution of the classifier.
Our model demonstrates significant robustness and precision, as evidenced by extensive validation of two established benchmarks for Knee OsteoArthritis (KOA) grade classification.
arXiv Detail & Related papers (2024-03-15T01:09:58Z) - Involution Fused ConvNet for Classifying Eye-Tracking Patterns of
Children with Autism Spectrum Disorder [1.225920962851304]
Autism Spectrum Disorder (ASD) is a complicated neurological condition which is challenging to diagnose. Numerous studies demonstrate that children diagnosed with ASD struggle with maintaining attention spans and have less focused vision.
Eye-tracking technology has drawn special attention in the context of ASD since anomalies in gaze have long been acknowledged as a defining feature of autism in general.
arXiv Detail & Related papers (2024-01-07T20:08:17Z) - DDxT: Deep Generative Transformer Models for Differential Diagnosis [51.25660111437394]
We show that a generative approach trained with simpler supervised and self-supervised learning signals can achieve superior results on the current benchmark.
The proposed Transformer-based generative network, named DDxT, autoregressively produces a set of possible pathologies, i.e., DDx, and predicts the actual pathology using a neural network.
arXiv Detail & Related papers (2023-12-02T22:57:25Z) - Self-supervised Feature Learning via Exploiting Multi-modal Data for
Retinal Disease Diagnosis [28.428216831922228]
This paper presents a novel self-supervised feature learning method by effectively exploiting multi-modal data for retinal disease diagnosis.
Our objective learns both modality-invariant features and patient-similarity features.
We evaluate our method on two public benchmark datasets for retinal disease diagnosis.
arXiv Detail & Related papers (2020-07-21T19:49:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.