Related papers: Synthetic Vasculature and Pathology Enhance Vision-Language Model Reasoning

Synthetic Vasculature and Pathology Enhance Vision-Language Model Reasoning

URL: http://arxiv.org/abs/2512.11060v1
Date: Thu, 11 Dec 2025 19:19:39 GMT
Title: Synthetic Vasculature and Pathology Enhance Vision-Language Model Reasoning
Authors: Chenjun Li, Cheng Wan, Laurin Lux, Alexander Berger, Richard B. Rosen, Martin J. Menten, Johannes C. Paetzold,
Abstract summary: We introduce Synthetic Vasculature Reasoning (SVR), a framework that controllably synthesizes images and corresponding text.<n>Based on this we curate OCTA-100K-SVR, an OCTA image-reasoning dataset with 100,000 pairs.<n>Our experiments show that a general-purpose VLM trained on the dataset achieves a zero-shot balanced classification accuracy of 89.67% on real OCTA images.
Score: 39.96133625333846
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) offer a promising path toward interpretable medical diagnosis by allowing users to ask about clinical explanations alongside predictions and across different modalities. However, training VLMs for detailed reasoning requires large-scale image-text datasets. In many specialized domains, for example in reading Optical Coherence Tomography Angiography (OCTA) images, such precise text with grounded description of pathologies is scarce or even non-existent. To overcome this bottleneck, we introduce Synthetic Vasculature Reasoning (SVR), a framework that controllably synthesizes images and corresponding text, specifically: realistic retinal vasculature with Diabetic Retinopathy (DR) features: capillary dropout, microaneurysms, neovascularization, and tortuosity, while automatically generating granular reasoning texts. Based on this we curate OCTA-100K-SVR, an OCTA image-reasoning dataset with 100,000 pairs. Our experiments show that a general-purpose VLM (Qwen3-VL-8b) trained on the dataset achieves a zero-shot balanced classification accuracy of 89.67% on real OCTA images, outperforming supervised baselines. Through human expert evaluation we also demonstrate that it significantly enhances explanation quality and pathology localization on clinical data.

Related papers

Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis [7.945705180020063]
We propose a knowledge-enhanced joint embedding framework that integrates retinal fundus images, clinical text, and structured patient data.<n>Our framework achieves near-perfect text-to-image retrieval performance with Recall@1 of 99.94% compared to fine-tuned CLIP's 1.29%.
arXiv Detail & Related papers (2025-12-22T18:41:45Z)
A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis [82.01597026329158]
We introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS) for pathology-specific text-to-image synthesis.<n>CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy.<n>This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations.
arXiv Detail & Related papers (2025-12-15T10:22:43Z)
DeepGI: Explainable Deep Learning for Gastrointestinal Image Classification [0.0]
The study confronts common endoscopic challenges such as variable lighting, fluctuating camera angles, and frequent imaging artifacts.<n>The best performing models, VGG16 and MobileNetV2, each achieved a test accuracy of 96.5%.<n>The approach includes explainable AI via Grad-CAM visualization, enabling identification of image regions most influential to model predictions.
arXiv Detail & Related papers (2025-11-26T22:35:57Z)
MIRNet: Integrating Constrained Graph-Based Reasoning with Pre-training for Diagnostic Medical Imaging [67.74482877175797]
MIRNet is a novel framework that integrates self-supervised pre-training with constrained graph-based reasoning.<n>We introduce TongueAtlas-4K, a benchmark comprising 4,000 images annotated with 22 diagnostic labels.
arXiv Detail & Related papers (2025-11-13T06:30:41Z)
Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report Generation [61.350584471060756]
Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images.<n>We propose Self-Supervised Anatomical Consistency Learning (SS-ACL) to align generated reports with corresponding anatomical regions.<n>SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy.
arXiv Detail & Related papers (2025-09-30T08:59:06Z)
Fine-tuning Vision Language Models with Graph-based Knowledge for Explainable Medical Image Analysis [44.0659716298839]
Current staging models for Diabetic Retinopathy (DR) are hardly interpretable.<n>We present a novel method that integrates graph representation learning with vision-language models (VLMs) to deliver explainable DR diagnosis.
arXiv Detail & Related papers (2025-03-12T20:19:07Z)
RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment [10.67889367763112]
RadAlign is a novel framework that combines the predictive accuracy of vision-language models with the reasoning capabilities of large language models.<n>Our framework maintains strong clinical interpretability while reducing hallucinations, advancing automated medical imaging and report analysis through integrated predictive and generative AI.
arXiv Detail & Related papers (2025-01-13T17:55:32Z)
GCS-M3VLT: Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer for Retinal Image Captioning [3.5948668755510136]
We propose a novel vision-language model for retinal image captioning that combines visual and textual features through a guided context self-attention mechanism.<n>Experiments on the DeepEyeNet dataset demonstrate a 0.023 BLEU@4 improvement, along with significant qualitative advancements.
arXiv Detail & Related papers (2024-12-23T03:49:29Z)
Radiology Report Generation Using Transformers Conditioned with Non-imaging Data [55.17268696112258]
This paper proposes a novel multi-modal transformer network that integrates chest x-ray (CXR) images and associated patient demographic information. The proposed network uses a convolutional neural network to extract visual features from CXRs and a transformer-based encoder-decoder network that combines the visual features with semantic text embeddings of patient demographic information.
arXiv Detail & Related papers (2023-11-18T14:52:26Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.