Anatomy-VLM: A Fine-grained Vision-Language Model for Medical Interpretation
- URL: http://arxiv.org/abs/2511.08402v1
- Date: Wed, 12 Nov 2025 01:57:26 GMT
- Title: Anatomy-VLM: A Fine-grained Vision-Language Model for Medical Interpretation
- Authors: Difei Gu, Yunhe Gao, Mu Zhou, Dimitris Metaxas,
- Abstract summary: We introduce Anatomy-VLM, a vision-language model that incorporates multi-scale information.<n>First, we design a model encoder to localize key anatomical features from entire medical images.<n>Second, these regions are enriched with structured knowledge for contextually-aware interpretation.<n>Third, the model encoder aligns multi-scale medical information to generate clinically-interpretable disease prediction.
- Score: 12.39187443971813
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Accurate disease interpretation from radiology remains challenging due to imaging heterogeneity. Achieving expert-level diagnostic decisions requires integration of subtle image features with clinical knowledge. Yet major vision-language models (VLMs) treat images as holistic entities and overlook fine-grained image details that are vital for disease diagnosis. Clinicians analyze images by utilizing their prior medical knowledge and identify anatomical structures as important region of interests (ROIs). Inspired from this human-centric workflow, we introduce Anatomy-VLM, a fine-grained, vision-language model that incorporates multi-scale information. First, we design a model encoder to localize key anatomical features from entire medical images. Second, these regions are enriched with structured knowledge for contextually-aware interpretation. Finally, the model encoder aligns multi-scale medical information to generate clinically-interpretable disease prediction. Anatomy-VLM achieves outstanding performance on both in- and out-of-distribution datasets. We also validate the performance of Anatomy-VLM on downstream image segmentation tasks, suggesting that its fine-grained alignment captures anatomical and pathology-related knowledge. Furthermore, the Anatomy-VLM's encoder facilitates zero-shot anatomy-wise interpretation, providing its strong expert-level clinical interpretation capabilities.
Related papers
- Multi-View Stenosis Classification Leveraging Transformer-Based Multiple-Instance Learning Using Real-World Clinical Data [76.89269238957593]
Coronary artery stenosis is a leading cause of cardiovascular disease, diagnosed by analyzing the coronary arteries from multiple angiography views.<n>We propose SegmentMIL, a transformer-based multi-view multiple-instance learning framework for patient-level stenosis classification.
arXiv Detail & Related papers (2026-02-02T13:07:52Z) - RAU: Reference-based Anatomical Understanding with Vision Language Models [26.06602931463068]
We introduce RAU, a framework for reference-based anatomical understanding with vision-language models (VLMs)<n>We first show that a VLM learns to identify anatomical regions through relative spatial reasoning between reference and target images.<n>Next, we demonstrate that the VLM-derived spatial cues can be seamlessly integrated with the fine-grained segmentation capability of SAM2.
arXiv Detail & Related papers (2025-09-26T14:32:03Z) - GRASPing Anatomy to Improve Pathology Segmentation [67.98147643529309]
We introduce GRASP, a modular plug-and-play framework that enhances pathology segmentation models.<n>We evaluate GRASP on two PET/CT datasets, conduct systematic ablation studies, and investigate the framework's inner workings.
arXiv Detail & Related papers (2025-08-05T12:26:36Z) - Causal Disentanglement for Robust Long-tail Medical Image Generation [80.15257897500578]
We propose a novel medical image generation framework, which generates independent pathological and structural features.<n>We leverage a diffusion model guided by pathological findings to model pathological features, enabling the generation of diverse counterfactual images.
arXiv Detail & Related papers (2025-04-20T01:54:18Z) - Fine-tuning Vision Language Models with Graph-based Knowledge for Explainable Medical Image Analysis [44.0659716298839]
Current staging models for Diabetic Retinopathy (DR) are hardly interpretable.<n>We present a novel method that integrates graph representation learning with vision-language models (VLMs) to deliver explainable DR diagnosis.
arXiv Detail & Related papers (2025-03-12T20:19:07Z) - Large-scale and Fine-grained Vision-language Pre-training for Enhanced CT Image Understanding [17.783231335173486]
We propose a fine-grained vision-language model (fVLM) for anatomy-level CT image interpretation.<n>Fine-grained alignment, however, faces considerable false-negative challenges.<n>We curated the largest CT dataset to date, comprising imaging and report data from 69,086 patients.
arXiv Detail & Related papers (2025-01-24T14:50:48Z) - U-Net in Medical Image Segmentation: A Review of Its Applications Across Modalities [0.0]
Recent advancements in Artificial Intelligence (AI) and Deep Learning (DL) have transformed medical image segmentation (MIS)<n>These models enable efficient, precise pixel-wise classification across various imaging modalities.<n>This review explores various medical imaging techniques, examines the U-Net architectures and their adaptations, and discusses their application across different modalities.
arXiv Detail & Related papers (2024-12-03T08:11:06Z) - Advancing Medical Image Segmentation: Morphology-Driven Learning with Diffusion Transformer [4.672688418357066]
We propose a novel Transformer Diffusion (DTS) model for robust segmentation in the presence of noise.
Our model, which analyzes the morphological representation of images, shows better results than the previous models in various medical imaging modalities.
arXiv Detail & Related papers (2024-08-01T07:35:54Z) - Anatomy-guided Pathology Segmentation [56.883822515800205]
We develop a generalist segmentation model that combines anatomical and pathological information, aiming to enhance the segmentation accuracy of pathological features.
Our Anatomy-Pathology Exchange (APEx) training utilizes a query-based segmentation transformer which decodes a joint feature space into query-representations for human anatomy.
In doing so, we are able to report the best results across the board on FDG-PET-CT and Chest X-Ray pathology segmentation tasks with a margin of up to 3.3% as compared to strong baseline methods.
arXiv Detail & Related papers (2024-07-08T11:44:15Z) - Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding [53.629132242389716]
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions.
VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information.
We propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge.
arXiv Detail & Related papers (2024-05-29T23:19:28Z) - Act Like a Radiologist: Towards Reliable Multi-view Correspondence
Reasoning for Mammogram Mass Detection [49.14070210387509]
We propose an Anatomy-aware Graph convolutional Network (AGN) for mammogram mass detection.
AGN is tailored for mammogram mass detection and endows existing detection methods with multi-view reasoning ability.
Experiments on two standard benchmarks reveal that AGN significantly exceeds the state-of-the-art performance.
arXiv Detail & Related papers (2021-05-21T06:48:34Z) - Few-shot Medical Image Segmentation using a Global Correlation Network
with Discriminative Embedding [60.89561661441736]
We propose a novel method for few-shot medical image segmentation.
We construct our few-shot image segmentor using a deep convolutional network trained episodically.
We enhance discriminability of deep embedding to encourage clustering of the feature domains of the same class.
arXiv Detail & Related papers (2020-12-10T04:01:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.