Related papers: Initialization matters in few-shot adaptation of vision-language models for histopathological image classification

Initialization matters in few-shot adaptation of vision-language models for histopathological image classification

URL: http://arxiv.org/abs/2602.18766v1
Date: Sat, 21 Feb 2026 09:08:40 GMT
Title: Initialization matters in few-shot adaptation of vision-language models for histopathological image classification
Authors: Pablo Meseguer, Rocío del Amor, Valery Naranjo,
Abstract summary: We propose Zero-Shot Multiple-Instance Learning (ZS-MIL) for zero-shot slide-level classification problems.<n>ZS-MIL uses the class-level embeddings of the VLM text encoder as the classification layer's starting point to compute each sample's bag-level probabilities.
Score: 1.3642432845689427
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision language models (VLM) pre-trained on datasets of histopathological image-caption pairs enabled zero-shot slide-level classification. The ability of VLM image encoders to extract discriminative features also opens the door for supervised fine-tuning for whole-slide image (WSI) classification, ideally using few labeled samples. Slide-level prediction frameworks require the incorporation of multiple instance learning (MIL) due to the gigapixel size of the WSI. Following patch-level feature extraction and aggregation, MIL frameworks rely on linear classifiers trained on top of the slide-level aggregated features. Classifier weight initialization has a large influence on Linear Probing performance in efficient transfer learning (ETL) approaches based on few-shot learning. In this work, we propose Zero-Shot Multiple-Instance Learning (ZS-MIL) to address the limitations of random classifier initialization that underperform zero-shot prediction in MIL problems. ZS-MIL uses the class-level embeddings of the VLM text encoder as the classification layer's starting point to compute each sample's bag-level probabilities. Through multiple experiments, we demonstrate the robustness of ZS-MIL compared to well-known weight initialization techniques both in terms of performance and variability in an ETL few-shot scenario for subtyping prediction.

Related papers

Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection [65.29550320117526]
We propose a novel framework named FineGrainedAD to improve anomaly localization performance.<n> Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings.
arXiv Detail & Related papers (2025-10-30T13:09:00Z)
Probabilistic Prototype Calibration of Vision-Language Models for Generalized Few-shot Semantic Segmentation [75.18058114915327]
Generalized Few-Shot Semanticnative (GFSS) aims to extend a segmentation model to novel classes with only a few annotated examples.<n>We propose FewCLIP, a probabilistic prototype calibration framework over multi-modal prototypes from the pretrained CLIP.<n>We show FewCLIP significantly outperforms state-of-the-art approaches across both GFSS and class-incremental setting.
arXiv Detail & Related papers (2025-06-28T18:36:22Z)
Benchmarking histopathology foundation models in a multi-center dataset for skin cancer subtyping [1.927195358774599]
Pretraining on large-scale, in-domain datasets grants histopathology foundation models (FM) the ability to learn task-agnostic data representations.<n>In computational pathology, automated whole slide image analysis requires multiple instance learning (MIL) frameworks due to the gigapixel scale of the slides.<n>Our work presents a novel benchmark for evaluating histopathology FMs as patch-level feature extractors within a MIL classification framework.
arXiv Detail & Related papers (2025-06-23T14:12:16Z)
Unbiased Max-Min Embedding Classification for Transductive Few-Shot Learning: Clustering and Classification Are All You Need [83.10178754323955]
Few-shot learning enables models to generalize from only a few labeled examples.<n>We propose the Unbiased Max-Min Embedding Classification (UMMEC) Method, which addresses the key challenges in few-shot learning.<n>Our method significantly improves classification performance with minimal labeled data, advancing the state-of-the-art in annotatedL.
arXiv Detail & Related papers (2025-03-28T07:23:07Z)
Slide-Level Prompt Learning with Vision Language Models for Few-Shot Multiple Instance Learning in Histopathology [21.81603581614496]
We address the challenge of few-shot classification in histopathology whole slide images (WSIs)<n>Our method distinguishes itself by utilizing pathological prior knowledge from language models to identify crucial local tissue types (patches) for WSI classification.<n>Our approach effectively aligns patch images with tissue types, and we fine-tune our model via prompt learning using only a few labeled WSIs per category.
arXiv Detail & Related papers (2025-03-21T15:40:37Z)
ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)<n>We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z)
MI-VisionShot: Few-shot adaptation of vision-language models for slide-level classification of histopathological images [1.927195358774599]
MI-VisionShot is a training-free adaptation method on top of vision-language models to predict slide-level labels. Our framework takes advantage of the excellent representation learning of VLM to create prototype-based classifiers.
arXiv Detail & Related papers (2024-10-21T11:01:20Z)
Queryable Prototype Multiple Instance Learning with Vision-Language Models for Incremental Whole Slide Image Classification [10.667645628712542]
Whole Slide Image (WSI) classification has very significant applications in clinical pathology.<n>This paper proposes the first Vision-Language-based framework with Queryable Prototype Multiple Instance Learning (QPMIL-VL) specially designed for incremental WSI classification.
arXiv Detail & Related papers (2024-10-14T14:49:34Z)
Position: From Correlation to Causation: Max-Pooling-Based Multi-Instance Learning Leads to More Robust Whole Slide Image Classification [51.95824566163554]
We argue that well-trained max-pooling-based MIL models can make predictions based on causal factors and avoid relying on spurious correlations.<n>We propose a simple yet effective max-pooling-based MIL method (FocusMIL) that outperforms existing mainstream attention-based methods on two datasets.
arXiv Detail & Related papers (2024-08-18T12:15:22Z)
Rethinking Pre-Trained Feature Extractor Selection in Multiple Instance Learning for Whole Slide Image Classification [2.375943263571389]
Multiple instance learning (MIL) has become a preferred method for gigapixel whole slide image (WSI) classification without requiring patch-level annotations.<n>This study systematically evaluating MIL feature extractors across three dimensions: pre-training dataset, backbone model, and pre-training method.
arXiv Detail & Related papers (2024-08-02T10:34:23Z)
RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories. This paper introduces a Retrieving And Ranking augmented method for MLLMs. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z)
CLAMP: Contrastive LAnguage Model Prompt-tuning [89.96914454453791]
We show that large language models can achieve good image classification performance when adapted this way. Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model.
arXiv Detail & Related papers (2023-12-04T05:13:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.