Related papers: SSVP: Synergistic Semantic-Visual Prompting for Industrial Zero-Shot Anomaly Detection

SSVP: Synergistic Semantic-Visual Prompting for Industrial Zero-Shot Anomaly Detection

URL: http://arxiv.org/abs/2601.09147v2
Date: Mon, 19 Jan 2026 02:37:00 GMT
Title: SSVP: Synergistic Semantic-Visual Prompting for Industrial Zero-Shot Anomaly Detection
Authors: Chenhao Fu, Han Fang, Xiuzheng Zheng, Wenbo Wei, Yonghua Li, Hao Sun, Xuelong Li,
Abstract summary: We propose Synergistic Semantic-Visual Prompting (SSVP), that efficiently fuses diverse visual encodings to elevate model's fine-grained perception.<n>SSVP achieves state-of-the-art performance with 93.0% Image-AUROC and 92.2% Pixel-AUROC on MVTec-AD, significantly outperforming existing zero-shot approaches.
Score: 55.54007781679915
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Zero-Shot Anomaly Detection (ZSAD) leverages Vision-Language Models (VLMs) to enable supervision-free industrial inspection. However, existing ZSAD paradigms are constrained by single visual backbones, which struggle to balance global semantic generalization with fine-grained structural discriminability. To bridge this gap, we propose Synergistic Semantic-Visual Prompting (SSVP), that efficiently fuses diverse visual encodings to elevate model's fine-grained perception. Specifically, SSVP introduces the Hierarchical Semantic-Visual Synergy (HSVS) mechanism, which deeply integrates DINOv3's multi-scale structural priors into the CLIP semantic space. Subsequently, the Vision-Conditioned Prompt Generator (VCPG) employs cross-modal attention to guide dynamic prompt generation, enabling linguistic queries to precisely anchor to specific anomaly patterns. Furthermore, to address the discrepancy between global scoring and local evidence, the Visual-Text Anomaly Mapper (VTAM) establishes a dual-gated calibration paradigm. Extensive evaluations on seven industrial benchmarks validate the robustness of our method; SSVP achieves state-of-the-art performance with 93.0% Image-AUROC and 92.2% Pixel-AUROC on MVTec-AD, significantly outperforming existing zero-shot approaches.

Related papers

AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models [21.682989096955467]
AG-VAS (Anchor-Guided Visual Anomaly) is a new framework that expands the LMM vocabulary with three learnable semantic anchor tokens.<n>AG-VAS achieves consistent state-of-the-art performance in the zero-shot setting.
arXiv Detail & Related papers (2026-03-01T22:25:23Z)
Open-Vocabulary Domain Generalization in Urban-Scene Segmentation [83.15573353963235]
Domain Generalization in Semantic Domain (DG-SS) aims to enable segmentation models to perform robustly in unseen environments.<n>Recent progress in Vision-Language Models (VLMs) has advanced Open-Vocabulary Semantic (OV-SS) by enabling models to recognize a broader range of concepts.<n>Yet, these models remain sensitive to domain shifts and struggle to maintain robustness when deployed in unseen environments.<n>We propose S2-Corr, a state-space-driven text-image correlation refinement mechanism that produces more consistent text-image correlations under distribution changes.
arXiv Detail & Related papers (2026-02-21T14:32:27Z)
OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL [63.388513841293616]
Existing forgery detection methods fail to handle the interleaved text, images, and videos prevalent in real-world misinformation.<n>To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding.<n>We propose textbf OmniVL-Guard, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding.
arXiv Detail & Related papers (2026-02-11T09:41:36Z)
SGHA-Attack: Semantic-Guided Hierarchical Alignment for Transferable Targeted Attacks on Vision-Language Models [73.19044613922911]
Large vision-language models (VLMs) are vulnerable to transfer-based adversarial perturbations.<n>We propose SGHA-Attack, a framework that adopts multiple target references and enforces intermediate-layer consistency.<n>Experiments on open-source and commercial black-box VLMs show that SGHA-Attack achieves stronger targeted transferability than prior methods.
arXiv Detail & Related papers (2026-02-02T03:10:41Z)
VTFusion: A Vision-Text Multimodal Fusion Network for Few-Shot Anomaly Detection [24.88767599022225]
Few-Shot Anomaly Detection (FSAD) has emerged as a critical paradigm for identifying irregularities using scarce normal references.<n>This study proposes VTFusion, a vision-text multimodal fusion framework tailored for FSAD.
arXiv Detail & Related papers (2026-01-23T00:30:24Z)
VIPER Strike: Defeating Visual Reasoning CAPTCHAs via Structured Vision-Language Inference [4.830055389040475]
Visual Reasoning CAPTCHAs (VRCs) combine visual scenes with natural-language queries that demand compositional inference over objects, attributes, and spatial relations.<n>We present ViPer, a unified attack framework that integrates structured multi-object visual perception with adaptive LLM-based reasoning.<n>ViPer achieves up to 93.2% success, approaching human-level performance across multiple benchmarks.
arXiv Detail & Related papers (2026-01-10T07:01:53Z)
GTMA: Dynamic Representation Optimization for OOD Vision-Language Models [10.940718051047023]
Vision-Matching models (VLMs) struggle in open-world applications, where out-of-distribution (OOD) concepts often trigger cross-modal alignment collapse.<n>We propose dynamic representation optimization, realized through the Guided Target-language Adaptation (GTMA) framework.<n> Experiments on ImageNet-R and the VISTA-Beyond benchmark demonstrate that GTMA improves zero-shot and few-shot OOD accuracy by up to 15-20 percent over the base VLM.
arXiv Detail & Related papers (2025-12-20T20:44:07Z)
V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs [66.81402538540458]
We propose V-Attack, a novel method for precise local semantic attacks.<n>V-Attack improves the attack success rate by an average of 36% over state-of-the-art methods.
arXiv Detail & Related papers (2025-11-25T11:51:17Z)
Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models [57.357091028792325]
Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment.<n>We propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment.<n>Our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS)
arXiv Detail & Related papers (2025-08-24T15:45:22Z)
HOID-R1: Reinforcement Learning for Open-World Human-Object Interaction Detection Reasoning with Multimodal Large Language Model [13.82578761807402]
We introduce HOID-R1, the first HOI detection framework that integrates chain-of-thought (CoT) guided fine-tuning with group relative policy optimization.<n>To mitigate hallucinations in the CoT reasoning, we introduce an "MLLM-as-a-judge" mechanism that supervises the CoT outputs.<n>Experiments show that HOID-R1 achieves state-of-the-art performance on HOI detection benchmarks and outperforms existing methods in open-world generalization to novel scenarios.
arXiv Detail & Related papers (2025-08-15T09:28:57Z)
ACD-CLIP: Decoupling Representation and Dynamic Fusion for Zero-Shot Anomaly Detection [21.26826497960086]
Pre-trained Vision-Language Models (VLMs) struggle with Zero-Shot Anomaly Detection (ZSAD)<n>We propose a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation.<n>We also introduce a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts.
arXiv Detail & Related papers (2025-08-11T10:03:45Z)
Universal Scene Graph Generation [77.53076485727414]
We present Universal Universal SG (USG), a novel representation capable of characterizing comprehensive semantic scenes.<n>We also introduce USG-Par, which effectively addresses two key bottlenecks of cross-modal object alignment and out-of-domain challenges.
arXiv Detail & Related papers (2025-03-19T08:55:06Z)
Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models [58.936893810674896]
Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems.<n>We introduce a multimodal large language model framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS)<n>We propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images.
arXiv Detail & Related papers (2025-01-03T09:25:04Z)
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality [69.76121008898677]
Fine-grained Selective Calibrated CLIP integrates local hard negative loss and selective calibrated regularization. Our evaluations show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities.
arXiv Detail & Related papers (2024-10-07T17:16:20Z)
HSVA: Hierarchical Semantic-Visual Adaptation for Zero-Shot Learning [74.76431541169342]
Zero-shot learning (ZSL) tackles the unseen class recognition problem, transferring semantic knowledge from seen classes to unseen ones. We propose a novel hierarchical semantic-visual adaptation (HSVA) framework to align semantic and visual domains. Experiments on four benchmark datasets demonstrate HSVA achieves superior performance on both conventional and generalized ZSL.
arXiv Detail & Related papers (2021-09-30T14:27:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.