Fine-tuning Vision Language Models with Graph-based Knowledge for Explainable Medical Image Analysis
- URL: http://arxiv.org/abs/2503.09808v1
- Date: Wed, 12 Mar 2025 20:19:07 GMT
- Title: Fine-tuning Vision Language Models with Graph-based Knowledge for Explainable Medical Image Analysis
- Authors: Chenjun Li, Laurin Lux, Alexander H. Berger, Martin J. Menten, Mert R. Sabuncu, Johannes C. Paetzold,
- Abstract summary: Current staging models for Diabetic Retinopathy (DR) are hardly interpretable.<n>We present a novel method that integrates graph representation learning with vision-language models (VLMs) to deliver explainable DR diagnosis.
- Score: 44.38638601819933
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurate staging of Diabetic Retinopathy (DR) is essential for guiding timely interventions and preventing vision loss. However, current staging models are hardly interpretable, and most public datasets contain no clinical reasoning or interpretation beyond image-level labels. In this paper, we present a novel method that integrates graph representation learning with vision-language models (VLMs) to deliver explainable DR diagnosis. Our approach leverages optical coherence tomography angiography (OCTA) images by constructing biologically informed graphs that encode key retinal vascular features such as vessel morphology and spatial connectivity. A graph neural network (GNN) then performs DR staging while integrated gradients highlight critical nodes and edges and their individual features that drive the classification decisions. We collect this graph-based knowledge which attributes the model's prediction to physiological structures and their characteristics. We then transform it into textual descriptions for VLMs. We perform instruction-tuning with these textual descriptions and the corresponding image to train a student VLM. This final agent can classify the disease and explain its decision in a human interpretable way solely based on a single image input. Experimental evaluations on both proprietary and public datasets demonstrate that our method not only improves classification accuracy but also offers more clinically interpretable results. An expert study further demonstrates that our method provides more accurate diagnostic explanations and paves the way for precise localization of pathologies in OCTA images.
Related papers
- From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation [46.99748372216857]
Vision-language models (VLMs) provide semantic context through textual descriptions but lack explanation precision required.
We propose a teacher-student framework that integrates both gaze and language supervision, leveraging their complementary strengths.
Our method achieves Dice scores of 80.78%, 80.53%, and 84.22%, respectively, improving 3-5% over gaze baselines without increasing the annotation burden.
arXiv Detail & Related papers (2025-04-15T16:32:15Z) - From Pixels to Histopathology: A Graph-Based Framework for Interpretable Whole Slide Image Analysis [81.19923502845441]
We develop a graph-based framework that constructs WSI graph representations.
We build tissue representations (nodes) that follow biological boundaries rather than arbitrary patches.
In our method's final step, we solve the diagnostic task through a graph attention network.
arXiv Detail & Related papers (2025-03-14T20:15:04Z) - Interpretable Retinal Disease Prediction Using Biology-Informed Heterogeneous Graph Representations [40.8160960729546]
Interpretability is crucial to enhance trust in machine learning models for medical diagnostics.<n>This work proposes a method that surpasses the performance of established machine learning models.
arXiv Detail & Related papers (2025-02-23T19:27:47Z) - Learning Generalized Medical Image Representations through Image-Graph Contrastive Pretraining [11.520404630575749]
We develop an Image-Graph Contrastive Learning framework that pairs chest X-rays with structured report knowledge graphs automatically extracted from radiology notes.
Our approach uniquely encodes the disconnected graph components via a relational graph convolution network and transformer attention.
arXiv Detail & Related papers (2024-05-15T12:27:38Z) - Semi-Supervised Graph Representation Learning with Human-centric
Explanation for Predicting Fatty Liver Disease [2.992602379681373]
This study explores the potential of graph representation learning within a semi-supervised learning framework.
Our approach constructs a subject similarity graph to identify risk patterns from health checkup data.
arXiv Detail & Related papers (2024-03-05T08:59:45Z) - Multimodal brain age estimation using interpretable adaptive
population-graph learning [58.99653132076496]
We propose a framework that learns a population graph structure optimized for the downstream task.
An attention mechanism assigns weights to a set of imaging and non-imaging features.
By visualizing the attention weights that were the most important for the graph construction, we increase the interpretability of the graph.
arXiv Detail & Related papers (2023-07-10T15:35:31Z) - Dynamic Graph Enhanced Contrastive Learning for Chest X-ray Report
Generation [92.73584302508907]
We propose a knowledge graph with Dynamic structure and nodes to facilitate medical report generation with Contrastive Learning.
In detail, the fundamental structure of our graph is pre-constructed from general knowledge.
Each image feature is integrated with its very own updated graph before being fed into the decoder module for report generation.
arXiv Detail & Related papers (2023-03-18T03:53:43Z) - Pixel-Level Explanation of Multiple Instance Learning Models in
Biomedical Single Cell Images [52.527733226555206]
We investigate the use of four attribution methods to explain a multiple instance learning models.
We study two datasets of acute myeloid leukemia with over 100 000 single cell images.
We compare attribution maps with the annotations of a medical expert to see how the model's decision-making differs from the human standard.
arXiv Detail & Related papers (2023-03-15T14:00:11Z) - Deep Co-Attention Network for Multi-View Subspace Learning [73.3450258002607]
We propose a deep co-attention network for multi-view subspace learning.
It aims to extract both the common information and the complementary information in an adversarial setting.
In particular, it uses a novel cross reconstruction loss and leverages the label information to guide the construction of the latent representation.
arXiv Detail & Related papers (2021-02-15T18:46:44Z) - Context Matters: Graph-based Self-supervised Representation Learning for
Medical Images [21.23065972218941]
We introduce a novel approach with two levels of self-supervised representation learning objectives.
We use graph neural networks to incorporate the relationship between different anatomical regions.
Our model can identify clinically relevant regions in the images.
arXiv Detail & Related papers (2020-12-11T16:26:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.