PathFLIP: Fine-grained Language-Image Pretraining for Versatile Computational Pathology
- URL: http://arxiv.org/abs/2512.17621v1
- Date: Fri, 19 Dec 2025 14:26:50 GMT
- Title: PathFLIP: Fine-grained Language-Image Pretraining for Versatile Computational Pathology
- Authors: Fengchun Liu, Songhan Jiang, Linghan Cai, Ziyue Wang, Yongbing Zhang,
- Abstract summary: We propose PathFLIP (Pathology Fine-grained Language-Image Pretraining), a novel framework for holistic Whole Slide Images (WSI) interpretation.<n>PathFLIP decomposes slide-level captions into region-level subcaptions and generates text-conditioned region embeddings to facilitate precise visual-language grounding.
- Score: 6.821738567680833
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Vision-Language Models (VLMs) have achieved notable progress in computational pathology (CPath), the gigapixel scale and spatial heterogeneity of Whole Slide Images (WSIs) continue to pose challenges for multimodal understanding. Existing alignment methods struggle to capture fine-grained correspondences between textual descriptions and visual cues across thousands of patches from a slide, compromising their performance on downstream tasks. In this paper, we propose PathFLIP (Pathology Fine-grained Language-Image Pretraining), a novel framework for holistic WSI interpretation. PathFLIP decomposes slide-level captions into region-level subcaptions and generates text-conditioned region embeddings to facilitate precise visual-language grounding. By harnessing Large Language Models (LLMs), PathFLIP can seamlessly follow diverse clinical instructions and adapt to varied diagnostic contexts. Furthermore, it exhibits versatile capabilities across multiple paradigms, efficiently handling slide-level classification and retrieval, fine-grained lesion localization, and instruction following. Extensive experiments demonstrate that PathFLIP outperforms existing large-scale pathological VLMs on four representative benchmarks while requiring significantly less training data, paving the way for fine-grained, instruction-aware WSI interpretation in clinical practice.
Related papers
- Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection [65.29550320117526]
We propose a novel framework named FineGrainedAD to improve anomaly localization performance.<n> Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings.
arXiv Detail & Related papers (2025-10-30T13:09:00Z) - GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification [4.922864692096282]
Multiple Instance Learning (MIL) is the leading approach for whole slide image (WSI) classification.<n>Recent work has introduced vision-language models (VLMs) into MIL pipelines to incorporate medical knowledge.<n>We propose a vision-language MIL framework with two key contributions.
arXiv Detail & Related papers (2025-08-02T09:59:39Z) - Slide-Level Prompt Learning with Vision Language Models for Few-Shot Multiple Instance Learning in Histopathology [21.81603581614496]
We address the challenge of few-shot classification in histopathology whole slide images (WSIs)<n>Our method distinguishes itself by utilizing pathological prior knowledge from language models to identify crucial local tissue types (patches) for WSI classification.<n>Our approach effectively aligns patch images with tissue types, and we fine-tune our model via prompt learning using only a few labeled WSIs per category.
arXiv Detail & Related papers (2025-03-21T15:40:37Z) - ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)<n>We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - Contrastive Localized Language-Image Pre-Training [60.4967533101887]
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations.<n>We propose Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules.<n>CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks.
arXiv Detail & Related papers (2024-10-03T17:56:09Z) - Pathology-knowledge Enhanced Multi-instance Prompt Learning for Few-shot Whole Slide Image Classification [19.070685830687285]
In clinical settings, restricted access to pathology slides is inevitable due to patient privacy concerns and the prevalence of rare or emerging diseases.
This paper proposes a multi-instance prompt learning framework enhanced with pathology knowledge.
Our method demonstrates superior performance in three challenging clinical tasks, significantly outperforming comparative few-shot methods.
arXiv Detail & Related papers (2024-07-15T15:31:55Z) - CPLIP: Zero-Shot Learning for Histopathology with Comprehensive Vision-Language Alignment [40.811510317145675]
CPLIP is a new unsupervised technique to enhance the alignment of images and text in histopathology.
evaluated across multiple histopathology tasks, CPLIP shows notable improvements in zero-shot learning scenarios.
To encourage further research and replication, the code for CPLIP is available on GitHub.
arXiv Detail & Related papers (2024-06-07T18:39:58Z) - Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning [64.1316997189396]
We present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images.
Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets.
arXiv Detail & Related papers (2024-03-21T17:58:56Z) - PathM3: A Multimodal Multi-Task Multiple Instance Learning Framework for Whole Slide Image Classification and Captioning [35.24716774767677]
We present PathM3, a multi-task, multiple instance learning framework for WSI classification and captioning.
Our method overcomes data scarcity in WSI-level captions by leveraging limited WSI diagnostic caption data.
arXiv Detail & Related papers (2024-03-13T21:19:12Z) - A self-supervised framework for learning whole slide representations [52.774822784847565]
We present Slide Pre-trained Transformers (SPT) for gigapixel-scale self-supervision of whole slide images.
We benchmark SPT visual representations on five diagnostic tasks across three biomedical microscopy datasets.
arXiv Detail & Related papers (2024-02-09T05:05:28Z) - VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature
Alignment [52.489874804051304]
VoLTA is a new vision-language pre-training paradigm that only utilizes image-caption data but fine-grained region-level image understanding.
VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training.
Experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA.
arXiv Detail & Related papers (2022-10-09T01:49:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.