Zero-shot Nuclei Detection via Visual-Language Pre-trained Models
- URL: http://arxiv.org/abs/2306.17659v1
- Date: Fri, 30 Jun 2023 13:44:13 GMT
- Title: Zero-shot Nuclei Detection via Visual-Language Pre-trained Models
- Authors: Yongjian Wu, Yang Zhou, Jiya Saiyin, Bingzheng Wei, Maode Lai,
Jianzhong Shou, Yubo Fan, Yan Xu
- Abstract summary: Large-scale visual-language pre-trained models have proven their excellent performance in downstream object detection for natural scenes.
However, zero-shot nuclei detection on H&E images via VLPMs remains underexplored.
In this paper, we attempt to explore the potential of the object-level VLPM, Grounded Language-Image Pre-training model, for zero-shot nuclei detection.
- Score: 19.526504045149895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale visual-language pre-trained models (VLPM) have proven their
excellent performance in downstream object detection for natural scenes.
However, zero-shot nuclei detection on H\&E images via VLPMs remains
underexplored. The large gap between medical images and the web-originated
text-image pairs used for pre-training makes it a challenging task. In this
paper, we attempt to explore the potential of the object-level VLPM, Grounded
Language-Image Pre-training (GLIP) model, for zero-shot nuclei detection.
Concretely, an automatic prompts design pipeline is devised based on the
association binding trait of VLPM and the image-to-text VLPM BLIP, avoiding
empirical manual prompts engineering. We further establish a self-training
framework, using the automatically designed prompts to generate the preliminary
results as pseudo labels from GLIP and refine the predicted boxes in an
iterative manner. Our method achieves a remarkable performance for label-free
nuclei detection, surpassing other comparison methods. Foremost, our work
demonstrates that the VLPM pre-trained on natural image-text pairs exhibits
astonishing potential for downstream tasks in the medical field as well. Code
will be released at https://github.com/wuyongjianCODE/VLPMNuD.
Related papers
- AttriPrompter: Auto-Prompting with Attribute Semantics for Zero-shot Nuclei Detection via Visual-Language Pre-trained Models [20.071868757497278]
This paper investigates the potential of object-level visual-language pre-training for zero-shot nuclei detection.
We propose an innovative auto-prompting pipeline, named AttriPrompter, to avoid subjective manual prompt design.
Our method exhibits remarkable performance in label-free nuclei detection, outperforming all existing unsupervised methods.
arXiv Detail & Related papers (2024-10-22T08:48:41Z) - OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer [63.141027246418]
We propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency.
We provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to object detector with simple alignment.
Experimental results demonstrate that the proposed approach is superior over existing real-time open-vocabulary detectors on standard Zero-Shot LVIS benchmark.
arXiv Detail & Related papers (2024-07-15T12:15:27Z) - Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models [19.683461002518147]
Test-Time Prototype Shifting (TPS) is a pioneering approach designed to adapt vision-language models to test datasets using unlabeled test inputs.
TPS not only facilitates optimization-free prototype reuse for subsequent predictions but also enables seamless integration with current advancements in prompt engineering.
A notable aspect of our framework is its significantly reduced memory and computational demands when compared to conventional text-prompt tuning methods.
arXiv Detail & Related papers (2024-03-19T17:54:34Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - Text-to-Image Diffusion Models are Zero-Shot Classifiers [8.26990105697146]
We investigate text-to-image diffusion models by proposing a method for evaluating them as zero-shot classifiers.
We apply our method to Stable Diffusion and Imagen, using it to probe fine-grained aspects of the models' knowledge.
They perform competitively with CLIP on a wide range of zero-shot image classification datasets.
arXiv Detail & Related papers (2023-03-27T14:15:17Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z) - F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language
Models [54.21757555804668]
We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models.
F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining.
arXiv Detail & Related papers (2022-09-30T17:59:52Z) - Unsupervised Vision-and-Language Pre-training Without Parallel Images
and Captions [92.47566804182338]
We investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora.
In particular, we propose to conduct mask-and-predict'' pre-training on text-only and image-only corpora.
We find that such a simple approach performance close to a model pre-trained with aligned data, on four English V&L benchmarks.
arXiv Detail & Related papers (2020-10-24T08:17:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.