A Medical Multimodal Diagnostic Framework Integrating Vision-Language Models and Logic Tree Reasoning
- URL: http://arxiv.org/abs/2512.21583v1
- Date: Thu, 25 Dec 2025 09:01:06 GMT
- Title: A Medical Multimodal Diagnostic Framework Integrating Vision-Language Models and Logic Tree Reasoning
- Authors: Zelin Zang, Wenyi Gu, Siqi Ma, Dan Yang, Yue Shen, Zhu Zhang, Guohui Fan, Wing-Kuen Ling, Fuji Yang,
- Abstract summary: We propose a diagnostic framework built upon LLaVA that combines vision-language alignment with logic-regularized reasoning.<n>We show that our method improves diagnostic accuracy and yields more interpretable reasoning traces on multimodal tasks, while remaining competitive on text-only settings.
- Score: 24.842846823884557
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid growth of large language models (LLMs) and vision-language models (VLMs) in medicine, simply integrating clinical text and medical imaging does not guarantee reliable reasoning. Existing multimodal models often produce hallucinations or inconsistent chains of thought, limiting clinical trust. We propose a diagnostic framework built upon LLaVA that combines vision-language alignment with logic-regularized reasoning. The system includes an input encoder for text and images, a projection module for cross-modal alignment, a reasoning controller that decomposes diagnostic tasks into steps, and a logic tree generator that assembles stepwise premises into verifiable conclusions. Evaluations on MedXpertQA and other benchmarks show that our method improves diagnostic accuracy and yields more interpretable reasoning traces on multimodal tasks, while remaining competitive on text-only settings. These results suggest a promising step toward trustworthy multimodal medical AI.
Related papers
- MedVerse: Efficient and Reliable Medical Reasoning via DAG-Structured Parallel Execution [63.128360383691295]
We propose MedVerse, a reasoning framework for complex medical inference.<n>For data creation, we introduce the MedVerse Curator, which synthesizes knowledge-grounded medical reasoning paths.<n>We develop a customized inference engine that supports parallel execution without additional overhead.
arXiv Detail & Related papers (2026-02-07T12:54:01Z) - M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding [66.78251988482222]
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning.<n>Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path.<n>M3CoTBench aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare.
arXiv Detail & Related papers (2026-01-13T17:42:27Z) - MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning [52.064286116035134]
We develop MedAlign, a framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA)<n>We first propose a multimodal Direct Preference Optimization (mDPO) objective to align preference learning with visual context.<n>We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM.
arXiv Detail & Related papers (2025-10-24T02:11:05Z) - MedLA: A Logic-Driven Multi-Agent Framework for Complex Medical Reasoning with Large Language Models [26.152027922514957]
textscMedLA is a logic-driven multi-agent framework built on large language models.<n>Agents engage in a graph-guided discussion to compare and iteratively refine their logic trees.<n>We demonstrate that textscMedLA consistently outperforms both static role-based systems and single-agent baselines.
arXiv Detail & Related papers (2025-09-28T08:06:39Z) - Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning [13.783146290218738]
We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning.<n>The model integrates detection, segmentation, and multimodal chain-of-thought reasoning.<n>It supports pixel-level lesion localization, structured report generation, and physician-like diagnostic inference.
arXiv Detail & Related papers (2025-09-23T14:42:31Z) - MedChat: A Multi-Agent Framework for Multimodal Diagnosis with Large Language Models [9.411749481805355]
Integrating glaucoma detection with large language models (LLMs) presents an automated strategy to mitigate ophthalmologist shortages.<n>Applying general LLMs to medical imaging remains challenging due to hallucinations, limited interpretability, and insufficient domain-specific medical knowledge.<n>We propose MedChat, a multi-agent diagnostic framework and platform that combines specialized vision models with multiple role-specific LLM agents.
arXiv Detail & Related papers (2025-06-09T03:51:18Z) - Advancing AI Research Assistants with Expert-Involved Learning [84.30323604785646]
Large language models (LLMs) and large multimodal models (LMMs) promise to accelerate biomedical discovery, yet their reliability remains unclear.<n>We introduce ARIEL (AI Research Assistant for Expert-in-the-Loop Learning), an open-source evaluation and optimization framework.<n>We find that state-of-the-art models generate fluent but incomplete summaries, whereas LMMs struggle with detailed visual reasoning.
arXiv Detail & Related papers (2025-05-03T14:21:48Z) - Proof-of-TBI -- Fine-Tuned Vision Language Model Consortium and OpenAI-o3 Reasoning LLM-Based Medical Diagnosis Support System for Mild Traumatic Brain Injury (TBI) Prediction [1.1488411226515398]
We propose Proof-of-TBI, a medical diagnosis support system that integrates vision-language models with the OpenAI-o3 reasoning large language model (LLM)<n>Our approach fine-tunes multiple vision-language models using a labeled dataset of TBI MRI scans, training them to diagnose TBI symptoms effectively.<n>The system evaluates the predictions from all fine-tuned vision language models using the OpenAI-o3 reasoning LLM, a model that has demonstrated remarkable reasoning performance.
arXiv Detail & Related papers (2025-04-25T19:49:30Z) - Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding [72.18719355481052]
We introduce a novel task called Medical Report Grounding (MRG)<n>MRG aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner.<n>We propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases.
arXiv Detail & Related papers (2024-04-10T07:41:35Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.