From Classification to Cross-Modal Understanding: Leveraging Vision-Language Models for Fine-Grained Renal Pathology
- URL: http://arxiv.org/abs/2511.11984v1
- Date: Sat, 15 Nov 2025 01:44:11 GMT
- Title: From Classification to Cross-Modal Understanding: Leveraging Vision-Language Models for Fine-Grained Renal Pathology
- Authors: Zhenhao Guo, Rachit Saluja, Tianyuan Yao, Quan Liu, Junchao Zhu, Haibo Wang, Daniel Reisenbüchler, Yuankai Huo, Benjamin Liechty, David J. Pisapia, Kenji Ikemura, Steven Salvatoree, Surya Seshane, Mert R. Sabuncu, Yihe Yang, Ruining Deng,
- Abstract summary: We model fine-grained glomerular subtyping as a clinically realistic few-shot problem.<n>We evaluate both pathology-specialized and general-purpose vision-language models under this setting.
- Score: 9.268389327736735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-grained glomerular subtyping is central to kidney biopsy interpretation, but clinically valuable labels are scarce and difficult to obtain. Existing computational pathology approaches instead tend to evaluate coarse diseased classification under full supervision with image-only models, so it remains unclear how vision-language models (VLMs) should be adapted for clinically meaningful subtyping under data constraints. In this work, we model fine-grained glomerular subtyping as a clinically realistic few-shot problem and systematically evaluate both pathology-specialized and general-purpose vision-language models under this setting. We assess not only classification performance (accuracy, AUC, F1) but also the geometry of the learned representations, examining feature alignment between image and text embeddings and the separability of glomerular subtypes. By jointly analyzing shot count, model architecture and domain knowledge, and adaptation strategy, this study provides guidance for future model selection and training under real clinical data constraints. Our results indicate that pathology-specialized vision-language backbones, when paired with the vanilla fine-tuning, are the most effective starting point. Even with only 4-8 labeled examples per glomeruli subtype, these models begin to capture distinctions and show substantial gains in discrimination and calibration, though additional supervision continues to yield incremental improvements. We also find that the discrimination between positive and negative examples is as important as image-text alignment. Overall, our results show that supervision level and adaptation strategy jointly shape both diagnostic performance and multimodal structure, providing guidance for model selection, adaptation strategies, and annotation investment.
Related papers
- Investigating the Impact of Histopathological Foundation Models on Regressive Prediction of Homologous Recombination Deficiency [52.50039435394964]
We systematically evaluate foundation models for regression-based tasks.<n>We extract patch-level features from whole slide images (WSI) using five state-of-the-art foundation models.<n>Models are trained to predict continuous HRD scores based on these extracted features across breast, endometrial, and lung cancer cohorts.
arXiv Detail & Related papers (2026-01-29T14:06:50Z) - Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report Generation [61.350584471060756]
Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images.<n>We propose Self-Supervised Anatomical Consistency Learning (SS-ACL) to align generated reports with corresponding anatomical regions.<n>SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy.
arXiv Detail & Related papers (2025-09-30T08:59:06Z) - Glo-VLMs: Leveraging Vision-Language Models for Fine-Grained Diseased Glomerulus Classification [7.87247433522498]
We introduce Glo-VLMs, a systematic framework designed to explore the adaptation of vision-language models to fine-grained glomerular classification.<n>Our approach leverages curated pathology images alongside clinical text prompts to facilitate joint image-text representation learning.<n>To ensure a fair comparison, we evaluate all models using standardized multi-class metrics, aiming to clarify the practical requirements and potential of large pretrained models for specialized clinical research applications.
arXiv Detail & Related papers (2025-08-21T21:05:44Z) - DiagR1: A Vision-Language Model Trained via Reinforcement Learning for Digestive Pathology Diagnosis [7.5173141954286775]
We construct a large scale gastrointestinal pathology dataset containing both microscopic descriptions and diagnostic conclusions.<n>This design guides the model to better capture image specific features and maintain semantic consistency in generation.<n>Our solution outperforms state of the art models with 18.7% higher clinical relevance, 32.4% improved structural completeness, and 41.2% fewer diagnostic errors.
arXiv Detail & Related papers (2025-07-24T14:12:20Z) - The Skin Game: Revolutionizing Standards for AI Dermatology Model Comparison [0.6144680854063939]
Deep Learning approaches in dermatological image classification have shown promising results, yet the field faces significant methodological challenges that impede proper evaluation.<n>This paper presents a systematic analysis of current methodological practices in skin disease classification research, revealing substantial inconsistencies in data preparation, augmentation strategies, and performance reporting.<n>We propose comprehensive methodological recommendations for model development, evaluation, and clinical deployment, emphasizing rigorous data preparation, systematic error analysis, and specialized protocols for different image types.
arXiv Detail & Related papers (2025-02-04T17:15:36Z) - Visual Prompt Engineering for Vision Language Models in Radiology [0.17183214167143138]
Contrastive Language-Image Pretraining (CLIP) offers a promising solution by enabling zero-shot classification through multimodal large-scale pretraining.<n>While CLIP effectively captures global image content, radiology requires a more localized focus on specific pathology regions to enhance both interpretability and diagnostic accuracy.<n>We explore the potential of incorporating visual cues into zero-shot classification, embedding visual markers, such as arrows, bounding boxes, and circles, directly into radiological images to guide model attention.
arXiv Detail & Related papers (2024-08-28T13:53:27Z) - Transformer-Based Self-Supervised Learning for Histopathological Classification of Ischemic Stroke Clot Origin [0.0]
Identifying the thromboembolism source in ischemic stroke is crucial for treatment and secondary prevention.
This study describes a self-supervised deep learning approach in digital pathology of emboli for classifying ischemic stroke clot origin.
arXiv Detail & Related papers (2024-05-01T23:40:12Z) - Fairness Evolution in Continual Learning for Medical Imaging [47.52603262576663]
This study examines how bias evolves across tasks using domain-specific fairness metrics and how different CL strategies impact this evolution.<n>Our results show that Learning without Forgetting and Pseudo-Label achieve optimal classification performance, but Pseudo-Label is less biased.
arXiv Detail & Related papers (2024-04-10T09:48:52Z) - Strategies to Improve Real-World Applicability of Laparoscopic Anatomy Segmentation Models [6.8726432208129555]
We systematically analyze the impact of class characteristics, training and test data composition, and modeling parameters on eight segmentation metrics.
Our findings support two adjustments to account for data biases in surgical data science.
arXiv Detail & Related papers (2024-03-25T21:08:26Z) - Robust and Interpretable Medical Image Classifiers via Concept
Bottleneck Models [49.95603725998561]
We propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts.
Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model.
arXiv Detail & Related papers (2023-10-04T21:57:09Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Rethinking Semi-Supervised Medical Image Segmentation: A
Variance-Reduction Perspective [51.70661197256033]
We propose ARCO, a semi-supervised contrastive learning framework with stratified group theory for medical image segmentation.
We first propose building ARCO through the concept of variance-reduced estimation and show that certain variance-reduction techniques are particularly beneficial in pixel/voxel-level segmentation tasks.
We experimentally validate our approaches on eight benchmarks, i.e., five 2D/3D medical and three semantic segmentation datasets, with different label settings.
arXiv Detail & Related papers (2023-02-03T13:50:25Z) - Semi-supervised Medical Image Classification with Relation-driven
Self-ensembling Model [71.80319052891817]
We present a relation-driven semi-supervised framework for medical image classification.
It exploits the unlabeled data by encouraging the prediction consistency of given input under perturbations.
Our method outperforms many state-of-the-art semi-supervised learning methods on both single-label and multi-label image classification scenarios.
arXiv Detail & Related papers (2020-05-15T06:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.