Explainable Parkinsons Disease Gait Recognition Using Multimodal RGB-D Fusion and Large Language Models
- URL: http://arxiv.org/abs/2512.04425v1
- Date: Thu, 04 Dec 2025 03:43:43 GMT
- Title: Explainable Parkinsons Disease Gait Recognition Using Multimodal RGB-D Fusion and Large Language Models
- Authors: Manar Alnaasan, Md Selim Sarowar, Sungho Kim,
- Abstract summary: This paper presents an explainable multimodal framework that integrates RGB and Depth (RGB-D) data to recognize Parkinsonian gait patterns.<n>By combining multimodal feature learning with language-based interpretability, this study bridges the gap between visual recognition and clinical understanding.
- Score: 6.2676602262188625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate and interpretable gait analysis plays a crucial role in the early detection of Parkinsons disease (PD),yet most existing approaches remain limited by single-modality inputs, low robustness, and a lack of clinical transparency. This paper presents an explainable multimodal framework that integrates RGB and Depth (RGB-D) data to recognize Parkinsonian gait patterns under realistic conditions. The proposed system employs dual YOLOv11-based encoders for modality-specific feature extraction, followed by a Multi-Scale Local-Global Extraction (MLGE) module and a Cross-Spatial Neck Fusion mechanism to enhance spatial-temporal representation. This design captures both fine-grained limb motion (e.g., reduced arm swing) and overall gait dynamics (e.g., short stride or turning difficulty), even in challenging scenarios such as low lighting or occlusion caused by clothing. To ensure interpretability, a frozen Large Language Model (LLM) is incorporated to translate fused visual embeddings and structured metadata into clinically meaningful textual explanations. Experimental evaluations on multimodal gait datasets demonstrate that the proposed RGB-D fusion framework achieves higher recognition accuracy, improved robustness to environmental variations, and clear visual-linguistic reasoning compared with single-input baselines. By combining multimodal feature learning with language-based interpretability, this study bridges the gap between visual recognition and clinical understanding, offering a novel vision-language paradigm for reliable and explainable Parkinsons disease gait analysis. Code:https://github.com/manaralnaasan/RGB-D_parkinson-LLM
Related papers
- Concept-Enhanced Multimodal RAG: Towards Interpretable and Accurate Radiology Report Generation [12.226029763256962]
Radiology Report Generation through Vision-Language Models (VLMs) promises to reduce documentation burden, improve reporting consistency, and accelerate clinical adoption.<n>Existing research treats interpretability and accuracy as separate objectives, with concept-based explainability techniques focusing primarily on transparency.<n>We present Concept-Enhanced Multimodal RAG (CEMRAG), a unified framework that decomposes visual representations into interpretable clinical concepts.
arXiv Detail & Related papers (2026-02-17T15:18:07Z) - Uncertainty-Aware Vision-Language Segmentation for Medical Imaging [12.545486211087791]
We introduce a novel uncertainty-aware multimodal segmentation framework for medical diagnosis.<n>We propose a Modality Decoding Attention Block (MoDAB) with a lightweight State Space Mixer (SSMix) to enable efficient cross-modal fusion.<n>Our results highlight the importance of incorporating uncertainty modelling and structured modality alignment in vision-language medical segmentation tasks.
arXiv Detail & Related papers (2026-02-16T06:27:51Z) - Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology [46.83014413674925]
STAMP is a spatial transcriptomics-augmented multimodal pathology representation learning framework.<n>Our study shows that self-supervised, gene-guided training provides a robust and task-agnostic signal for learning pathology image representations.<n>We validate STAMP across six datasets and four downstream tasks, where it consistently achieves strong performance.
arXiv Detail & Related papers (2026-02-15T00:59:13Z) - Cross-Linguistic Persona-Driven Data Synthesis for Robust Multimodal Cognitive Decline Detection [20.599682298329213]
We introduce SynCog, a novel framework integrating controllable zero-shot multimodal data synthesis with Chain-of-Thought deduction fine-tuning.<n>This generative paradigm enables the rapid, zero-shot expansion of clinical corpora across diverse languages.<n>Experiments on the ADReSS and ADReSSo benchmarks demonstrate that augmenting limited clinical data with synthetic phenotypes yields competitive diagnostic performance.
arXiv Detail & Related papers (2026-02-08T14:10:05Z) - A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis [82.01597026329158]
We introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS) for pathology-specific text-to-image synthesis.<n>CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy.<n>This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations.
arXiv Detail & Related papers (2025-12-15T10:22:43Z) - Towards Stable Cross-Domain Depression Recognition under Missing Modalities [46.292478012586066]
Depression poses serious public health risks, including suicide, underscoring the urgency of timely and scalable screening.<n>We propose a unified framework for Stable Cross-Domain Depression Recognition based on Multimodal Large Language Model (SCD-MLLM)<n>The framework supports the integration and processing of heterogeneous depression-related data collected from varied sources.
arXiv Detail & Related papers (2025-12-06T14:19:57Z) - Self-Supervised Cross-Encoder for Neurodegenerative Disease Diagnosis [6.226851122403944]
We propose a novel self-supervised cross-encoder framework that leverages the temporal continuity in longitudinal MRI scans for supervision.<n>This framework disentangles learned representations into two components: a static representation, constrained by contrastive learning, which captures stable anatomical features; and a dynamic representation, guided by input-gradient regularization, which reflects temporal changes.<n> Experimental results on the Alzheimer's Disease Neuroimaging Initiative dataset demonstrate that our method achieves superior classification accuracy and improved interpretability.
arXiv Detail & Related papers (2025-09-09T11:52:24Z) - Foundation Model for Skeleton-Based Human Action Understanding [56.89025287217221]
This paper presents a Unified Skeleton-based Dense Representation Learning framework.<n>USDRL consists of a Transformer-based Dense Spatio-Temporal (DSTE), Multi-Grained Feature Decorrelation (MG-FD), and Multi-Perspective Consistency Training (MPCT)
arXiv Detail & Related papers (2025-08-18T02:42:16Z) - PhysLLM: Harnessing Large Language Models for Cross-Modal Remote Physiological Sensing [49.243031514520794]
Large Language Models (LLMs) excel at capturing long-range signals due to their text-centric design.<n>PhysLLM achieves state-the-art accuracy and robustness, demonstrating superior generalization across lighting variations and motion scenarios.
arXiv Detail & Related papers (2025-05-06T15:18:38Z) - MMLNB: Multi-Modal Learning for Neuroblastoma Subtyping Classification Assisted with Textual Description Generation [1.8947479010393964]
We introduce MMLNB, a multi-modal learning model that integrates pathological images with generated textual descriptions to improve classification accuracy and interpretability.<n>This research creates a scalable AI-driven framework for digital pathology, enhancing reliability and interpretability in Neuroblastoma subtyping classification.
arXiv Detail & Related papers (2025-03-17T08:38:46Z) - Robust Multimodal Learning for Ophthalmic Disease Grading via Disentangled Representation [30.697291934309206]
multimodal data is rare in real-world applications due to a lack of medical equipment and concerns about data privacy.<n>Traditional deep learning methods typically address these issues by learning representations in latent space.<n>Authors propose the Essence-Point and Disentangle Representation Learning (EDRL) strategy, which integrates a self-distillation mechanism into an end-to-end framework.
arXiv Detail & Related papers (2025-03-07T10:58:38Z) - ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports.
Based on this dataset, we focus on the challanging task of unsupervised pretraining.
We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z) - Cross-Modal Causal Intervention for Medical Report Generation [107.76649943399168]
Radiology Report Generation (RRG) is essential for computer-aided diagnosis and medication guidance.<n> generating accurate lesion descriptions remains challenging due to spurious correlations from visual-linguistic biases.<n>We propose a two-stage framework named CrossModal Causal Representation Learning (CMCRL)<n> Experiments on IU-Xray and MIMIC-CXR show that our CMCRL pipeline significantly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-03-16T07:23:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.