TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection
- URL: http://arxiv.org/abs/2510.21171v2
- Date: Mon, 27 Oct 2025 02:24:03 GMT
- Title: TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection
- Authors: Qihang Zhou, Binbin Gao, Guansong Pang, Xin Wang, Jiming Chen, Shibo He,
- Abstract summary: TokenCLIP is a token-wise adaptation framework for anomaly learning.<n>It enables dynamic alignment between visual and learnable textual spaces for fine-grained anomaly learning.
- Score: 62.95726973851089
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adapting CLIP for anomaly detection on unseen objects has shown strong potential in a zero-shot manner. However, existing methods typically rely on a single textual space to align with visual semantics across diverse objects and domains. The indiscriminate alignment hinders the model from accurately capturing varied anomaly semantics. We propose TokenCLIP, a token-wise adaptation framework that enables dynamic alignment between visual and learnable textual spaces for fine-grained anomaly learning. Rather than mapping all visual tokens to a single, token-agnostic textual space, TokenCLIP aligns each token with a customized textual subspace that represents its visual characteristics. Explicitly assigning a unique learnable textual space to each token is computationally intractable and prone to insufficient optimization. We instead expand the token-agnostic textual space into a set of orthogonal subspaces, and then dynamically assign each token to a subspace combination guided by semantic affinity, which jointly supports customized and efficient token-wise adaptation. To this end, we formulate dynamic alignment as an optimal transport problem, where all visual tokens in an image are transported to textual subspaces based on semantic similarity. The transport constraints of OT ensure sufficient optimization across subspaces and encourage them to focus on different semantics. Solving the problem yields a transport plan that adaptively assigns each token to semantically relevant subspaces. A top-k masking is then applied to sparsify the plan and specialize subspaces for distinct visual regions. Extensive experiments demonstrate the superiority of TokenCLIP.
Related papers
- SAPL: Semantic-Agnostic Prompt Learning in CLIP for Weakly Supervised Image Manipulation Localization [45.19935082419337]
Malicious image manipulation threatens public safety and requires efficient localization methods.<n>Existing weakly supervised methods rely on image-level binary labels and focus on global classification.<n>We propose Semantic-Agnostic Prompt Learning (SAPL) in CLIP, which learns text prompts that intentionally encode non-semantic, boundary-centric cues.
arXiv Detail & Related papers (2026-01-09T07:25:55Z) - SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning [53.638998508418545]
This paper introduces a new task Image Collaborative and Captioning'' (SegCaptioning)<n>SegCaptioning aims to translate a straightforward prompt, like a bounding box around an object, into diverse semantic interpretations represented by (caption, masks) pairs.<n>This task poses significant challenges, including accurately capturing a user's intention from a minimal prompt while simultaneously predicting multiple semantically aligned caption words and masks.
arXiv Detail & Related papers (2025-12-01T18:33:04Z) - CoPatch: Zero-Shot Referring Image Segmentation by Leveraging Untapped Spatial Knowledge in CLIP [26.827036116024914]
textscCoPatch is a zero-shot RIS framework that enhances spatial representations in both text and image modalities.<n>We show that textscCoPatch significantly improves spatial grounding in zero-shot RIS across RefCOCO, RefCOCO+, RefCOCOg, and PhraseCut (+ 2--7 mIoU) without requiring any additional training.
arXiv Detail & Related papers (2025-09-27T04:12:10Z) - Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment [38.04426918886084]
Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics.<n>Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs)<n>We introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention.
arXiv Detail & Related papers (2025-06-27T14:55:40Z) - Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models [49.122200327049676]
Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models.<n>When extended to vision-language models (VLMs), RoPE and its variants enforce relative positional dependencies separately within text and image tokens.<n>We introduce Circle-RoPE, a novel encoding scheme designed to eliminate spurious cross-modal biases.
arXiv Detail & Related papers (2025-05-22T09:05:01Z) - Descriminative-Generative Custom Tokens for Vision-Language Models [101.40245125955306]
This paper explores the possibility of learning custom tokens for representing new concepts in Vision-Language Models (VLMs)<n>Our aim is to learn tokens that can be effective for both discriminative and generative tasks while composing well with words to form new input queries.
arXiv Detail & Related papers (2025-02-17T18:13:42Z) - LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
Referring 3D is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query.
We propose a novel Referring 3D pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask.
We achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.
arXiv Detail & Related papers (2024-10-17T07:47:41Z) - InvSeg: Test-Time Prompt Inversion for Semantic Segmentation [33.60580908728705]
InvSeg is a test-time prompt inversion method that tackles open-vocabulary semantic segmentation.<n>We introduce Contrastive Soft Clustering (CSC) to align derived masks with the image's structure information.<n>InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities.
arXiv Detail & Related papers (2024-10-15T10:20:31Z) - Selective Vision-Language Subspace Projection for Few-shot CLIP [55.361337202198925]
We introduce a method called Selective Vision-Language Subspace Projection (SSP)
SSP incorporates local image features and utilizes them as a bridge to enhance the alignment between image-text pairs.
Our approach entails only training-free matrix calculations and can be seamlessly integrated into advanced CLIP-based few-shot learning frameworks.
arXiv Detail & Related papers (2024-07-24T03:45:35Z) - Tokenize Anything via Prompting [65.93061853439512]
We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything.
We train a generalizable model with massive segmentation masks, eg, SA-1B masks, and semantic priors from a pre-trained CLIP model with 5 billion parameters.
We believe this model can be a versatile region-level image tokenizer, capable of encoding general-purpose region context.
arXiv Detail & Related papers (2023-12-14T17:01:02Z) - Towards Robust and Semantically Organised Latent Representations for
Unsupervised Text Style Transfer [6.467090475885798]
We introduce EPAAEs (versading Perturbed Adrial AutoEncoders) which completes this perturbation model.
We empirically show that this (a) produces a better organised latent space that clusters stylistically similar sentences together.
We also extend the text style transfer tasks to NLI datasets and show that these more complex definitions of style are learned best by EPAAE.
arXiv Detail & Related papers (2022-05-04T20:04:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.