Prompt as Knowledge Bank: Boost Vision-language model via Structural Representation for zero-shot medical detection
- URL: http://arxiv.org/abs/2502.16223v1
- Date: Sat, 22 Feb 2025 13:22:25 GMT
- Title: Prompt as Knowledge Bank: Boost Vision-language model via Structural Representation for zero-shot medical detection
- Authors: Yuguang Yang, Tongfei Chen, Haoyu Huang, Linlin Yang, Chunyu Xie, Dawei Leng, Xianbin Cao, Baochang Zhang,
- Abstract summary: We propose StructuralGLIP, which encodes prompts into a latent knowledge bank layer-by-layer.<n>In each layer, we select highly similar features from both the image representation and the knowledge bank, forming structural representations that capture nuanced relationships between image patches and target descriptions.<n>Experiments demonstrate that StructuralGLIP achieves a +4.1% AP improvement over prior state-of-the-art methods across seven zero-shot medical detection benchmarks.
- Score: 32.99689130650503
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-shot medical detection can further improve detection performance without relying on annotated medical images even upon the fine-tuned model, showing great clinical value. Recent studies leverage grounded vision-language models (GLIP) to achieve this by using detailed disease descriptions as prompts for the target disease name during the inference phase. However, these methods typically treat prompts as equivalent context to the target name, making it difficult to assign specific disease knowledge based on visual information, leading to a coarse alignment between images and target descriptions. In this paper, we propose StructuralGLIP, which introduces an auxiliary branch to encode prompts into a latent knowledge bank layer-by-layer, enabling more context-aware and fine-grained alignment. Specifically, in each layer, we select highly similar features from both the image representation and the knowledge bank, forming structural representations that capture nuanced relationships between image patches and target descriptions. These features are then fused across modalities to further enhance detection performance. Extensive experiments demonstrate that StructuralGLIP achieves a +4.1\% AP improvement over prior state-of-the-art methods across seven zero-shot medical detection benchmarks, and consistently improves fine-tuned models by +3.2\% AP on endoscopy image datasets.
Related papers
- Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detections [50.343419243749054]
Anomaly Detection (AD) involves identifying deviations from normal data distributions.
We propose a novel approach that conditions the prompts of the text encoder based on image context extracted from the vision encoder.
Our method achieves state-of-the-art performance, improving performance by 2% to 29% across different metrics on 14 datasets.
arXiv Detail & Related papers (2025-04-15T10:42:25Z) - PathSegDiff: Pathology Segmentation using Diffusion model representations [63.20694440934692]
We propose PathSegDiff, a novel approach for histopathology image segmentation that leverages Latent Diffusion Models (LDMs) as pre-trained featured extractors.
Our method utilizes a pathology-specific LDM, guided by a self-supervised encoder, to extract rich semantic information from H&E stained histopathology images.
Our experiments demonstrate significant improvements over traditional methods on the BCSS and GlaS datasets.
arXiv Detail & Related papers (2025-04-09T14:58:21Z) - RL4Med-DDPO: Reinforcement Learning for Controlled Guidance Towards Diverse Medical Image Generation using Vision-Language Foundation Models [0.7165255458140439]
Vision-Language Foundation Models (VLFM) have shown a tremendous increase in performance in terms of generating high-resolution, photorealistic natural images.
We propose a multi-stage architecture where a pre-trained VLFM provides a cursory semantic understanding, while a reinforcement learning algorithm refines the alignment through an iterative process.
We demonstrate the effectiveness of our method on a medical imaging skin dataset where the generated images exhibit improved generation quality and alignment with prompt over the fine-tuned Stable Diffusion.
arXiv Detail & Related papers (2025-03-20T01:51:05Z) - Leveraging Vision-Language Embeddings for Zero-Shot Learning in Histopathology Images [7.048241543461529]
We propose a novel framework called Multi-Resolution Prompt-guided Hybrid Embedding (MR-PHE) to address these challenges in zero-shot histopathology image classification.
We introduce a hybrid embedding strategy that integrates global image embeddings with weighted patch embeddings.
A similarity-based patch weighting mechanism assigns attention-like weights to patches based on their relevance to class embeddings.
arXiv Detail & Related papers (2025-03-13T12:18:37Z) - Fine-tuning Vision Language Models with Graph-based Knowledge for Explainable Medical Image Analysis [44.38638601819933]
Current staging models for Diabetic Retinopathy (DR) are hardly interpretable.
We present a novel method that integrates graph representation learning with vision-language models (VLMs) to deliver explainable DR diagnosis.
arXiv Detail & Related papers (2025-03-12T20:19:07Z) - Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection [11.532639713283226]
We use strategies rooted in domain knowledge to train a model for LGE detection using text from clinical reports.<n>We standardize the orientation of the images in an anatomy-informed way to enable better alignment of spatial and text features.<n> ablation studies are carried out to elucidate the contributions of each design component to the overall performance of the model.
arXiv Detail & Related papers (2025-02-18T15:30:48Z) - Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning [64.1316997189396]
We present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images.
Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets.
arXiv Detail & Related papers (2024-03-21T17:58:56Z) - Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images [68.42215385041114]
This paper introduces a novel lightweight multi-level adaptation and comparison framework to repurpose the CLIP model for medical anomaly detection.
Our approach integrates multiple residual adapters into the pre-trained visual encoder, enabling a stepwise enhancement of visual features across different levels.
Our experiments on medical anomaly detection benchmarks demonstrate that our method significantly surpasses current state-of-the-art models.
arXiv Detail & Related papers (2024-03-19T09:28:19Z) - Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework [43.453943987647015]
Medical vision language pre-training has emerged as a frontier of research, enabling zero-shot pathological recognition.
Due to the complex semantics of biomedical texts, current methods struggle to align medical images with key pathological findings in unstructured reports.
This is achieved by consulting a large language model and medical experts.
Ours improves the accuracy of recent methods by up to 8.56% and 17.26% for seen and unseen categories, respectively.
arXiv Detail & Related papers (2024-03-12T13:18:22Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Cross-level Contrastive Learning and Consistency Constraint for
Semi-supervised Medical Image Segmentation [46.678279106837294]
We propose a cross-level constrastive learning scheme to enhance representation capacity for local features in semi-supervised medical image segmentation.
With the help of the cross-level contrastive learning and consistency constraint, the unlabelled data can be effectively explored to improve segmentation performance.
arXiv Detail & Related papers (2022-02-08T15:12:11Z) - Few-shot Medical Image Segmentation using a Global Correlation Network
with Discriminative Embedding [60.89561661441736]
We propose a novel method for few-shot medical image segmentation.
We construct our few-shot image segmentor using a deep convolutional network trained episodically.
We enhance discriminability of deep embedding to encourage clustering of the feature domains of the same class.
arXiv Detail & Related papers (2020-12-10T04:01:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.