IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain
- URL: http://arxiv.org/abs/2506.10730v3
- Date: Fri, 20 Jun 2025 06:52:02 GMT
- Title: IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain
- Authors: Hong Huang, Weixiang Sun, Zhijian Wu, Jingwen Niu, Donghuan Lu, Xian Wu, Yefeng Zheng,
- Abstract summary: IQE-CLIP is an innovative framework for anomaly detection tasks in medical domain.<n>We introduce class-based prompting tokens and learnable prompting tokens for better adaptation of CLIP to the medical domain.<n>Our framework achieves state-of-the-art on both zero-shot and few-shot tasks.
- Score: 40.584137588388245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, the rapid advancements of vision-language models, such as CLIP, leads to significant progress in zero-/few-shot anomaly detection (ZFSAD) tasks. However, most existing CLIP-based ZFSAD methods commonly assume prior knowledge of categories and rely on carefully crafted prompts tailored to specific scenarios. While such meticulously designed text prompts effectively capture semantic information in the textual space, they fall short of distinguishing normal and anomalous instances within the joint embedding space. Moreover, these ZFSAD methods are predominantly explored in industrial scenarios, with few efforts conducted to medical tasks. To this end, we propose an innovative framework for ZFSAD tasks in medical domain, denoted as IQE-CLIP. We reveal that query embeddings, which incorporate both textual and instance-aware visual information, are better indicators for abnormalities. Specifically, we first introduce class-based prompting tokens and learnable prompting tokens for better adaptation of CLIP to the medical domain. Then, we design an instance-aware query module (IQM) to extract region-level contextual information from both text prompts and visual features, enabling the generation of query embeddings that are more sensitive to anomalies. Extensive experiments conducted on six medical datasets demonstrate that IQE-CLIP achieves state-of-the-art performance on both zero-shot and few-shot tasks. We release our code and data at https://github.com/hongh0/IQE-CLIP/.
Related papers
- AF-CLIP: Zero-Shot Anomaly Detection via Anomaly-Focused CLIP Adaptation [8.252046294696585]
We propose AF-CLIP (Anomaly-Focused CLIP) by dramatically enhancing its visual representations to focus on local defects.<n>Our approach introduces a lightweight adapter that emphasizes anomaly-relevant patterns in visual features.<n>Our method is also extended to few-shot scenarios by extra memory banks.
arXiv Detail & Related papers (2025-07-26T13:34:38Z) - ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection [2.622385361961154]
Zero-shot anomaly detection (ZSAD) aims to detect anomalies without any target domain training samples, relying solely on external auxiliary data.<n>Existing CLIP-based methods attempt to activate the model's ZSAD potential via handcrafted or static learnable prompts.<n>ViP$2$-CLIP fuses global and multi-scale local visual context to adaptively generate fine-grained textual prompts.
arXiv Detail & Related papers (2025-05-23T10:01:11Z) - GenCLIP: Generalizing CLIP Prompts for Zero-shot Anomaly Detection [13.67800822455087]
A key challenge in ZSAD is learning general prompts stably and utilizing them effectively.<n>We propose GenCLIP, a novel framework that learns and leverages general prompts more effectively.<n>We introduce a dual-branch inference strategy, where a vision-enhanced branch captures fine-grained category-specific features, while a query-only branch prioritizes generalization.
arXiv Detail & Related papers (2025-04-21T07:38:25Z) - KAnoCLIP: Zero-Shot Anomaly Detection through Knowledge-Driven Prompt Learning and Enhanced Cross-Modal Integration [9.688664292809785]
Zero-shot anomaly detection (ZSAD) identifies anomalies without needing training samples from the target dataset.<n> vision-language models like CLIP show potential in ZSAD but have limitations.<n>We introduce KAnoCLIP, a novel ZSAD framework that leverages vision-language models.<n> KAnoCLIP achieves state-of-the-art performance in ZSAD across 12 industrial and medical datasets.
arXiv Detail & Related papers (2025-01-07T13:51:41Z) - GlocalCLIP: Object-agnostic Global-Local Prompt Learning for Zero-shot Anomaly Detection [5.530212768657544]
We introduce glocal contrastive learning to improve the complementary learning of global and local prompts.<n>The generalization performance of GlocalCLIP in ZSAD was demonstrated on 15 real-world datasets.
arXiv Detail & Related papers (2024-11-09T05:22:13Z) - VCP-CLIP: A visual context prompting model for zero-shot anomaly segmentation [19.83954061346437]
We propose a visual context prompting model (VCP-CLIP) for ZSAS task based on CLIP.
In specific, we first design a Pre-VCP module to embed global visual information into the text prompt.
Then, we propose a novel Post-VCP module, that adjusts the text embeddings utilizing the fine-grained features of the images.
arXiv Detail & Related papers (2024-07-17T02:54:41Z) - Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks [54.153914606302486]
In-context learning (ICL) ability has emerged with the increasing scale of large language models (LLMs)
We propose a new paradigm called Hint-enhanced In-Context Learning (HICL) to explore the power of ICL in open-domain question answering.
arXiv Detail & Related papers (2023-11-03T14:39:20Z) - DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - Temporal-aware Hierarchical Mask Classification for Video Semantic
Segmentation [62.275143240798236]
Video semantic segmentation dataset has limited categories per video.
Less than 10% of queries could be matched to receive meaningful gradient updates during VSS training.
Our method achieves state-of-the-art performance on the latest challenging VSS benchmark VSPW without bells and whistles.
arXiv Detail & Related papers (2023-09-14T20:31:06Z) - Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot
Anomaly Localization [63.61093388441298]
Contrastive Language-Image Pre-training models have shown promising performance on zero-shot visual recognition tasks.
In this work, we propose AnoCLIP for zero-shot anomaly localization.
arXiv Detail & Related papers (2023-08-30T10:35:36Z) - Multi-modal Queried Object Detection in the Wild [72.16067634379226]
MQ-Det is an efficient architecture and pre-training strategy design for real-world object detection.
It incorporates vision queries into existing language-queried-only detectors.
MQ-Det's simple yet effective architecture and training strategy design is compatible with most language-queried object detectors.
arXiv Detail & Related papers (2023-05-30T12:24:38Z) - Label Words are Anchors: An Information Flow Perspective for
Understanding In-Context Learning [77.7070536959126]
In-context learning (ICL) emerges as a promising capability of large language models (LLMs)
In this paper, we investigate the working mechanism of ICL through an information flow lens.
We introduce an anchor re-weighting method to improve ICL performance, a demonstration compression technique to expedite inference, and an analysis framework for diagnosing ICL errors in GPT2-XL.
arXiv Detail & Related papers (2023-05-23T15:26:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.