Referring Industrial Anomaly Segmentation
- URL: http://arxiv.org/abs/2602.03673v1
- Date: Tue, 03 Feb 2026 15:53:30 GMT
- Title: Referring Industrial Anomaly Segmentation
- Authors: Pengfei Yue, Xiaokang Jiang, Yilin Lu, Jianghang Lin, Shengchuan Zhang, Liujuan Cao,
- Abstract summary: Industrial Anomaly Detection (IAD) is vital for manufacturing, yet traditional methods face challenges.<n>We propose Referring Industrial Anomaly (RIAS), a paradigm leveraging language to guide detection.<n>We introduce the MVTec-Ref dataset to support this, designed with diverse referring expressions and focusing on anomaly patterns.
- Score: 38.4461918800367
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Industrial Anomaly Detection (IAD) is vital for manufacturing, yet traditional methods face significant challenges: unsupervised approaches yield rough localizations requiring manual thresholds, while supervised methods overfit due to scarce, imbalanced data. Both suffer from the "One Anomaly Class, One Model" limitation. To address this, we propose Referring Industrial Anomaly Segmentation (RIAS), a paradigm leveraging language to guide detection. RIAS generates precise masks from text descriptions without manual thresholds and uses universal prompts to detect diverse anomalies with a single model. We introduce the MVTec-Ref dataset to support this, designed with diverse referring expressions and focusing on anomaly patterns, notably with 95% small anomalies. We also propose the Dual Query Token with Mask Group Transformer (DQFormer) benchmark, enhanced by Language-Gated Multi-Level Aggregation (LMA) to improve multi-scale segmentation. Unlike traditional methods using redundant queries, DQFormer employs only "Anomaly" and "Background" tokens for efficient visual-textual integration. Experiments demonstrate RIAS's effectiveness in advancing IAD toward open-set capabilities. Code: https://github.com/swagger-coder/RIAS-MVTec-Ref.
Related papers
- Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection [52.5174167737992]
Video anomaly detection (VAD) aims to identify abnormal events in videos.<n>We propose SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations.<n>Our method achieves state-of-the-art performance among tuning-free approaches requiring only 1% of training data.
arXiv Detail & Related papers (2026-02-27T13:48:50Z) - IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection [70.02774285130238]
This paper explores the combination of rich text semantics with both image-level and pixel-level information from images.<n>We propose IAD-GPT, a novel paradigm based on MLLMs for Industrial Anomaly Detection.<n>Experiments on MVTec-AD and VisA datasets demonstrate our state-of-the-art performance.
arXiv Detail & Related papers (2025-10-16T02:48:05Z) - LR-IAD:Mask-Free Industrial Anomaly Detection with Logical Reasoning [1.3124513975412255]
Industrial Anomaly Detection (IAD) is critical for ensuring product quality by identifying defects.<n>Existing vision-language models (VLMs) and Multimodal Large Language Models (MLLMs) address some limitations but rely on mask annotations.<n>We propose a reward function that dynamically prioritizes rare defect patterns during training to handle class imbalance.
arXiv Detail & Related papers (2025-04-28T06:52:35Z) - Towards General Visual-Linguistic Face Forgery Detection(V2) [90.6600794602029]
Face manipulation techniques have achieved significant advances, presenting serious challenges to security and social trust.<n>Recent works demonstrate that leveraging multimodal models can enhance the generalization and interpretability of face forgery detection.<n>We propose Face Forgery Text Generator (FFTG), a novel annotation pipeline that generates accurate text descriptions by leveraging forgery masks for initial region and type identification.
arXiv Detail & Related papers (2025-02-28T04:15:36Z) - Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models [29.078437003042357]
Zero-Shot Anomaly Detection (ZSAD) is an emerging AD paradigm.<n>We propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning.
arXiv Detail & Related papers (2025-02-11T14:50:43Z) - VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection [19.79027968793026]
Zero-shot anomaly detection (ZSAD) recognizes and localizes anomalies in previously unseen objects.
Existing ZSAD methods are limited by closed-world settings, struggling to unseen defects with predefined prompts.
We propose a novel framework VMAD (Visual-enhanced MLLM Anomaly Detection) that enhances MLLM with visual-based IAD knowledge and fine-grained perception.
arXiv Detail & Related papers (2024-09-30T09:51:29Z) - Learning Multi-Aspect Item Palette: A Semantic Tokenization Framework for Generative Recommendation [55.99632509895994]
We introduce LAMIA, a novel approach for multi-aspect semantic tokenization.<n>Unlike RQ-VAE, which uses a single embedding, LAMIA learns an item palette''--a collection of independent and semantically parallel embeddings.<n>Our results demonstrate significant improvements in recommendation accuracy over existing methods.
arXiv Detail & Related papers (2024-09-11T13:49:48Z) - Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection [86.24898024621008]
We present a novel large multimodal model applying vision experts for industrial anomaly detection(abbreviated to Myriad)<n>We utilize the anomaly map generated by the vision experts as guidance for LMMs, such that the vision model is guided to pay more attention to anomalous regions.<n>Our proposed method not only performs favorably against state-of-the-art methods, but also inherits the flexibility and instruction-following ability of LMMs in the field of IAD.
arXiv Detail & Related papers (2023-10-29T16:49:45Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - 2nd Place Winning Solution for the CVPR2023 Visual Anomaly and Novelty
Detection Challenge: Multimodal Prompting for Data-centric Anomaly Detection [10.682758791557436]
This report introduces the winning solution of the team Segment Any Anomaly for the CVPR2023 Visual Anomaly and Novelty Detection (VAND) challenge.
We present a novel framework, i.e., Segment Any Anomaly + (SAA$+$), for zero-shot anomaly segmentation with multi-modal prompts.
We will release the code of our winning solution for the CVPR2023 VAN.
arXiv Detail & Related papers (2023-06-15T11:49:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.