Related papers: Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection

Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection

URL: http://arxiv.org/abs/2510.26464v1
Date: Thu, 30 Oct 2025 13:09:00 GMT
Title: Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection
Authors: Yuanting Fan, Jun Liu, Xiaochen Chen, Bin-Bin Gao, Jian Li, Yong Liu, Jinlong Peng, Chengjie Wang,
Abstract summary: We propose a novel framework named FineGrainedAD to improve anomaly localization performance.<n> Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings.
Score: 65.29550320117526
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Few-shot anomaly detection (FSAD) methods identify anomalous regions with few known normal samples. Most existing methods rely on the generalization ability of pre-trained vision-language models (VLMs) to recognize potentially anomalous regions through feature similarity between text descriptions and images. However, due to the lack of detailed textual descriptions, these methods can only pre-define image-level descriptions to match each visual patch token to identify potential anomalous regions, which leads to the semantic misalignment between image descriptions and patch-level visual anomalies, achieving sub-optimal localization performance. To address the above issues, we propose the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and fine-grained textual descriptions for existing anomaly detection datasets with automatic construction pipeline. Based on the MFSC, we propose a novel framework named FineGrainedAD to improve anomaly localization performance, which consists of two components: Multi-Level Learnable Prompt (MLLP) and Multi-Level Semantic Alignment (MLSA). MLLP introduces fine-grained semantics into multi-level learnable prompts through automatic replacement and concatenation mechanism, while MLSA designs region aggregation strategy and multi-level alignment training to facilitate learnable prompts better align with corresponding visual regions. Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings on MVTec-AD and VisA datasets.

Related papers

PromptMAD: Cross-Modal Prompting for Multi-Class Visual Anomaly Localization [9.018570847586878]
We propose PromptMAD, a cross-modal prompting framework for unsupervised visual anomaly detection and localization.<n>Our method enriches visual reconstruction with semantic context, improving the detection of subtle and textural anomalies.<n>Our architecture also includes a supervised segmentor that fuses multi-scale convolutional features with Transformer-based spatial attention.
arXiv Detail & Related papers (2026-01-30T03:04:06Z)
IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection [70.02774285130238]
This paper explores the combination of rich text semantics with both image-level and pixel-level information from images.<n>We propose IAD-GPT, a novel paradigm based on MLLMs for Industrial Anomaly Detection.<n>Experiments on MVTec-AD and VisA datasets demonstrate our state-of-the-art performance.
arXiv Detail & Related papers (2025-10-16T02:48:05Z)
ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)<n>We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z)
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage [50.84150600032693]
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations.<n>We propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions.<n>Our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V.
arXiv Detail & Related papers (2024-12-20T01:37:22Z)
ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization [49.12958154544838]
ForgeryGPT is a novel framework that advances the Image Forgery Detection and localization task.<n>It captures high-order correlations of forged images from diverse linguistic feature spaces.<n>It enables explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture.
arXiv Detail & Related papers (2024-10-14T07:56:51Z)
VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection [19.79027968793026]
Zero-shot anomaly detection (ZSAD) recognizes and localizes anomalies in previously unseen objects. Existing ZSAD methods are limited by closed-world settings, struggling to unseen defects with predefined prompts. We propose a novel framework VMAD (Visual-enhanced MLLM Anomaly Detection) that enhances MLLM with visual-based IAD knowledge and fine-grained perception.
arXiv Detail & Related papers (2024-09-30T09:51:29Z)
FiLo: Zero-Shot Anomaly Detection by Fine-Grained Description and High-Quality Localization [31.854923603517264]
We propose a novel zero-shot anomaly detection (ZSAD) method called FiLo. FiLo comprises two components: adaptively learned Fine-Grained Description (FG-Des) and position-enhanced High- quality localization (HQ-Loc) Experimental results on datasets like MVTec and VisA demonstrate that FiLo significantly improves the performance of ZSAD in both detection and localization.
arXiv Detail & Related papers (2024-04-21T14:22:04Z)
Do LLMs Understand Visual Anomalies? Uncovering LLM's Capabilities in Zero-shot Anomaly Detection [18.414762007525137]
Large vision-language models (LVLMs) are proficient in deriving visual representations guided by natural language.<n>Recent explorations have utilized LVLMs to tackle zero-shot visual anomaly detection (VAD) challenges.<n>We present ALFA, a training-free approach designed to address these challenges via a unified model.
arXiv Detail & Related papers (2024-04-15T10:42:22Z)
Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot Anomaly Localization [63.61093388441298]
Contrastive Language-Image Pre-training models have shown promising performance on zero-shot visual recognition tasks. In this work, we propose AnoCLIP for zero-shot anomaly localization.
arXiv Detail & Related papers (2023-08-30T10:35:36Z)
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment [52.489874804051304]
VoLTA is a new vision-language pre-training paradigm that only utilizes image-caption data but fine-grained region-level image understanding. VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training. Experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA.
arXiv Detail & Related papers (2022-10-09T01:49:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.