MedVQA-TREE: A Multimodal Reasoning and Retrieval Framework for Sarcopenia Prediction
- URL: http://arxiv.org/abs/2508.19319v1
- Date: Tue, 26 Aug 2025 13:31:01 GMT
- Title: MedVQA-TREE: A Multimodal Reasoning and Retrieval Framework for Sarcopenia Prediction
- Authors: Pardis Moradbeiki, Nasser Ghadiri, Sayed Jalal Zahabi, Uffe Kock Wiil, Kristoffer Kittelmann Brockhattingen, Ali Ebrahimi,
- Abstract summary: We propose MedVQA-TREE, a framework that integrates a hierarchical image interpretation module, a gated feature-level fusion mechanism, and a novel multi-hop, multi-query retrieval strategy.<n>A gated fusion mechanism selectively integrates visual features with textual queries, while clinical knowledge is retrieved through a UMLS-guided pipeline accessing PubMed and a sarcopenia-specific external knowledge base.<n>The model achieved up to 99% diagnostic accuracy and outperformed previous state-of-the-art methods by over 10%.
- Score: 1.7775777785480917
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate sarcopenia diagnosis via ultrasound remains challenging due to subtle imaging cues, limited labeled data, and the absence of clinical context in most models. We propose MedVQA-TREE, a multimodal framework that integrates a hierarchical image interpretation module, a gated feature-level fusion mechanism, and a novel multi-hop, multi-query retrieval strategy. The vision module includes anatomical classification, region segmentation, and graph-based spatial reasoning to capture coarse, mid-level, and fine-grained structures. A gated fusion mechanism selectively integrates visual features with textual queries, while clinical knowledge is retrieved through a UMLS-guided pipeline accessing PubMed and a sarcopenia-specific external knowledge base. MedVQA-TREE was trained and evaluated on two public MedVQA datasets (VQA-RAD and PathVQA) and a custom sarcopenia ultrasound dataset. The model achieved up to 99% diagnostic accuracy and outperformed previous state-of-the-art methods by over 10%. These results underscore the benefit of combining structured visual understanding with guided knowledge retrieval for effective AI-assisted diagnosis in sarcopenia.
Related papers
- Sim4Seg: Boosting Multimodal Multi-disease Medical Diagnosis Segmentation with Region-Aware Vision-Language Similarity Masks [54.00822479127598]
We introduce a medical vision-language task named Medical Diagnosis (MDS)<n>MDS aims to understand clinical queries for medical images and generate the corresponding segmentation masks as well as diagnostic results.<n>We propose Sim4Seg, a novel framework that improves the performance of diagnosis segmentation.
arXiv Detail & Related papers (2025-11-10T03:22:42Z) - DuPLUS: Dual-Prompt Vision-Language Framework for Universal Medical Image Segmentation and Prognosis [5.494301428436596]
We introduce DuPLUS, a deep learning framework for efficient multi-modal medical image analysis.<n>DuPLUS introduces a novel vision-language framework that leverages hierarchical semantic prompts for fine-grained control over the analysis task.<n>For segmentation, DuPLUS is able to generalize across three imaging modalities, ten different various medical datasets, encompassing more than 30 organs and tumor types.
arXiv Detail & Related papers (2025-10-03T20:01:00Z) - RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis [56.373297358647655]
Retrieval-Augmented Diagnosis (RAD) is a novel framework that injects external knowledge into multimodal models directly on downstream tasks.<n>RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss transformer, and a dual decoder.
arXiv Detail & Related papers (2025-09-24T10:36:14Z) - A Fully Open and Generalizable Foundation Model for Ultrasound Clinical Applications [77.3888788549565]
We present EchoCare, a novel ultrasound foundation model for generalist clinical use.<n>We developed EchoCare via self-supervised learning on our curated, publicly available, large-scale dataset EchoCareData.<n>With minimal training, EchoCare outperforms state-of-the-art comparison models across 10 representative ultrasound benchmarks.
arXiv Detail & Related papers (2025-09-15T10:05:31Z) - PathMR: Multimodal Visual Reasoning for Interpretable Pathology Diagnosis [9.728322291979564]
We propose PathMR, a cell-level Multimodal visual Reasoning framework for Pathological image analysis.<n>We show that PathMR consistently outperforms state-of-the-art visual reasoning methods in text generation quality, segmentation accuracy, and cross-modal alignment.
arXiv Detail & Related papers (2025-08-28T14:46:24Z) - MEDMKG: Benchmarking Medical Knowledge Exploitation with Multimodal Knowledge Graph [28.79000907242469]
We propose MEDMKG, a Medical Multimodal Knowledge Graph that unifies visual and textual medical information through a multi-stage construction pipeline.<n>We evaluate MEDMKG across three tasks under two experimental settings, benchmarking twenty-four baseline methods and four state-of-the-art vision-language backbones on six datasets.<n>Results show that MEDMKG not only improves performance in downstream medical tasks but also offers a strong foundation for developing adaptive and robust strategies for multimodal knowledge integration in medical artificial intelligence.
arXiv Detail & Related papers (2025-05-22T18:41:46Z) - Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion [4.821565717653691]
Medical Visual Question Answering (Med-VQA) answers clinical questions using medical images, aiding diagnosis.<n>This study proposes a HiCA-VQA method, including two modules: Hierarchical Prompting for fine-grained medical questions and Hierarchical Answer Decoders.<n> Experiments on the Rad-Restruct benchmark demonstrate that the HiCA-VQA framework better outperforms existing state-of-the-art methods in answering hierarchical fine-grained questions.
arXiv Detail & Related papers (2025-04-04T03:03:12Z) - RadIR: A Scalable Framework for Multi-Grained Medical Image Retrieval via Radiology Report Mining [64.66825253356869]
We propose a novel methodology that leverages dense radiology reports to define image-wise similarity ordering at multiple granularities.<n>We construct two comprehensive medical imaging retrieval datasets: MIMIC-IR for Chest X-rays and CTRATE-IR for CT scans.<n>We develop two retrieval systems, RadIR-CXR and model-ChestCT, which demonstrate superior performance in traditional image-image and image-report retrieval tasks.
arXiv Detail & Related papers (2025-03-06T17:43:03Z) - MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis [2.9540164442363976]
This study leverages the CLIP (Contrastive Language-Image Pretraining) model, trained on different skin lesion datasets, to capture meaningful relationships between visual features and diagnostic criteria terms.<n>We propose a method called MedGrad E-CLIP, which builds on gradient-based E-CLIP by incorporating a weighted entropy mechanism designed for complex medical imaging like skin lesions.<n>By visually explaining how different features in an image relates to diagnostic criteria, this approach demonstrates the potential of advanced vision-language models in medical image analysis.
arXiv Detail & Related papers (2025-01-12T17:50:47Z) - Large-scale Long-tailed Disease Diagnosis on Radiology Images [51.453990034460304]
RadDiag is a foundational model supporting 2D and 3D inputs across various modalities and anatomies.
Our dataset, RP3D-DiagDS, contains 40,936 cases with 195,010 scans covering 5,568 disorders.
arXiv Detail & Related papers (2023-12-26T18:20:48Z) - Contextual Information Enhanced Convolutional Neural Networks for
Retinal Vessel Segmentation in Color Fundus Images [0.0]
An automatic retinal vessel segmentation system can effectively facilitate clinical diagnosis and ophthalmological research.
A deep learning based method has been proposed and several customized modules have been integrated into the well-known encoder-decoder architecture U-net.
As a result, the proposed method outperforms the work of predecessors and achieves state-of-the-art performance in Sensitivity/Recall, F1-score and MCC.
arXiv Detail & Related papers (2021-03-25T06:10:47Z) - G-MIND: An End-to-End Multimodal Imaging-Genetics Framework for
Biomarker Identification and Disease Classification [49.53651166356737]
We propose a novel deep neural network architecture to integrate imaging and genetics data, as guided by diagnosis, that provides interpretable biomarkers.
We have evaluated our model on a population study of schizophrenia that includes two functional MRI (fMRI) paradigms and Single Nucleotide Polymorphism (SNP) data.
arXiv Detail & Related papers (2021-01-27T19:28:04Z) - Inheritance-guided Hierarchical Assignment for Clinical Automatic
Diagnosis [50.15205065710629]
Clinical diagnosis, which aims to assign diagnosis codes for a patient based on the clinical note, plays an essential role in clinical decision-making.
We propose a novel framework to combine the inheritance-guided hierarchical assignment and co-occurrence graph propagation for clinical automatic diagnosis.
arXiv Detail & Related papers (2021-01-27T13:16:51Z) - Weakly supervised multiple instance learning histopathological tumor
segmentation [51.085268272912415]
We propose a weakly supervised framework for whole slide imaging segmentation.
We exploit a multiple instance learning scheme for training models.
The proposed framework has been evaluated on multi-locations and multi-centric public data from The Cancer Genome Atlas and the PatchCamelyon dataset.
arXiv Detail & Related papers (2020-04-10T13:12:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.