Related papers: Evaluating Cell Type Inference in Vision Language Models Under Varying Visual Context

Evaluating Cell Type Inference in Vision Language Models Under Varying Visual Context

URL: http://arxiv.org/abs/2506.12683v1
Date: Sun, 15 Jun 2025 01:50:16 GMT
Title: Evaluating Cell Type Inference in Vision Language Models Under Varying Visual Context
Authors: Samarth Singhal, Sandeep Singhal,
Abstract summary: Vision-Language Models (VLMs) have rapidly advanced alongside Large Language Models (LLMs)<n>This study evaluates the capabilities of prominent generative VLMs, such as GPT-4.1 and Gemini 2.5 Pro, for histopathology image classification tasks.
Score: 0.16385815610837165
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) have rapidly advanced alongside Large Language Models (LLMs). This study evaluates the capabilities of prominent generative VLMs, such as GPT-4.1 and Gemini 2.5 Pro, accessed via APIs, for histopathology image classification tasks, including cell typing. Using diverse datasets from public and private sources, we apply zero-shot and one-shot prompting methods to assess VLM performance, comparing them against custom-trained Convolutional Neural Networks (CNNs). Our findings demonstrate that while one-shot prompting significantly improves VLM performance over zero-shot ($p \approx 1.005 \times 10^{-5}$ based on Kappa scores), these general-purpose VLMs currently underperform supervised CNNs on most tasks. This work underscores both the promise and limitations of applying current VLMs to specialized domains like pathology via in-context learning. All code and instructions for reproducing the study can be accessed from the repository https://www.github.com/a12dongithub/VLMCCE.

Related papers

Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models [35.79522480146796]
We introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets.<n>We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings.
arXiv Detail & Related papers (2025-05-27T01:24:29Z)
Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction [80.67150791183126]
Pre-trained vision-language models (VLMs) have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks.<n>We propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations.<n>We show that DenseVLM can directly replace the original VLM in open-vocabulary object detection and image segmentation methods.
arXiv Detail & Related papers (2024-12-09T06:34:23Z)
Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z)
EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection [21.091101582856183]
We introduce a novel prompt learning-based framework for Efficient Zero-Shot HOI detection (EZ-HOI).<n>First, we introduce Large Language Model (LLM) and VLM guidance for learnable prompts, integrating detailed HOI descriptions and visual semantics to adapt VLMs to HOI tasks.<n>We show that our framework achieves state-of-the-art performance across various zero-shot settings with only 10.35% to 33.95% of the trainable parameters compared to existing methods.
arXiv Detail & Related papers (2024-10-31T13:06:29Z)
Boosting Vision-Language Models for Histopathology Classification: Predict all at once [11.644118356081531]
We introduce a transductive approach to vision-language models for histo-pathology. Our approach is highly efficient, processing $105$ patches in just a few seconds.
arXiv Detail & Related papers (2024-09-03T13:24:12Z)
The Neglected Tails in Vision-Language Models [51.79913798808725]
We show that vision-language models (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts. We propose REtrieval-Augmented Learning (REAL) to mitigate the imbalanced performance of zero-shot VLMs.
arXiv Detail & Related papers (2024-01-23T01:25:00Z)
Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting [83.21164539349273]
Pre-trained language models (PLMs) have played an increasing role in multimedia research. In this paper, we focus on exploring PLMs as a stand-alone model for vision-language reasoning tasks. We propose a novel transfer learning approach for PLMs, termed Dynamic Visual Prompting (DVP)
arXiv Detail & Related papers (2023-06-01T07:19:28Z)
Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models [76.410400238974]
We propose TTA with feedback to rectify the model output and prevent the model from becoming blindly confident. A CLIP model is adopted as the reward model during TTA and provides feedback for the VLM. The proposed textitreinforcement learning with CLIP feedback(RLCF) framework is highly flexible and universal.
arXiv Detail & Related papers (2023-05-29T11:03:59Z)
Vision-Language Models for Vision Tasks: A Survey [62.543250338410836]
Vision-Language Models (VLMs) learn rich vision-language correlation from web-scale image-text pairs. This paper provides a systematic review of visual language models for various visual recognition tasks.
arXiv Detail & Related papers (2023-04-03T02:17:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.