Fugu-MT 論文翻訳(概要): Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection

論文の概要: Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection

arxiv url: http://arxiv.org/abs/2510.26464v1
Date: Thu, 30 Oct 2025 13:09:00 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-31 16:05:09.817467
Title: Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection
Title（参考訳）: ファウショット異常検出のためのファイングラインドビジョン言語アライメントに向けて
Authors: Yuanting Fan, Jun Liu, Xiaochen Chen, Bin-Bin Gao, Jian Li, Yong Liu, Jinlong Peng, Chengjie Wang,
Abstract要約: 我々はFinGrainedADという新しいフレームワークを提案し、異常なローカライゼーション性能を改善する。実験により、提案されたFinGrainedADは、数ショット設定で全体的なパフォーマンスが優れていることが示された。
参考スコア（独自算出の注目度）: 65.29550320117526
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Few-shot anomaly detection (FSAD) methods identify anomalous regions with few known normal samples. Most existing methods rely on the generalization ability of pre-trained vision-language models (VLMs) to recognize potentially anomalous regions through feature similarity between text descriptions and images. However, due to the lack of detailed textual descriptions, these methods can only pre-define image-level descriptions to match each visual patch token to identify potential anomalous regions, which leads to the semantic misalignment between image descriptions and patch-level visual anomalies, achieving sub-optimal localization performance. To address the above issues, we propose the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and fine-grained textual descriptions for existing anomaly detection datasets with automatic construction pipeline. Based on the MFSC, we propose a novel framework named FineGrainedAD to improve anomaly localization performance, which consists of two components: Multi-Level Learnable Prompt (MLLP) and Multi-Level Semantic Alignment (MLSA). MLLP introduces fine-grained semantics into multi-level learnable prompts through automatic replacement and concatenation mechanism, while MLSA designs region aggregation strategy and multi-level alignment training to facilitate learnable prompts better align with corresponding visual regions. Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings on MVTec-AD and VisA datasets.
Abstract（参考訳）: FSAD法 (Few-shot Anomaly Detection) は、異常領域を同定する手法である。既存のほとんどの手法は、テキスト記述と画像の特徴的類似性を通じて潜在的に異常な領域を認識するために、事前訓練された視覚言語モデル(VLM)の一般化能力に依存している。しかし、詳細なテキスト記述の欠如により、これらの手法は各視覚パッチトークンにマッチする画像レベルの記述を事前に定義するだけで潜在的な異常領域を識別できるため、画像記述とパッチレベルの視覚異常とのセマンティックな相違が生じ、準最適ローカライゼーション性能が達成される。上記の問題に対処するため,MFSC(Multi-Level Fine-Grained Semantic Caption)を提案する。 MFSCに基づいて,Multi-Level Learnable Prompt (MLLP) とMulti-Level Semantic Alignment (MLSA) の2つのコンポーネントから構成される,FinedADという新しいフレームワークを提案する。 MLLPは、自動置換と連結機構を通じて、学習可能な複数のプロンプトに微粒なセマンティクスを導入し、MLSAは、学習可能なプロンプトと対応する視覚領域との整合性を高めるために、領域集約戦略と多レベルアライメントトレーニングを設計する。実験では、MVTec-ADとVisAデータセットのいくつかのショット設定において、提案されたFineGrainedADが全体的なパフォーマンスに優れたことが示されている。

論文の概要: Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection

関連論文リスト