Fugu-MT 論文翻訳(概要): AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models

論文の概要: AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models

arxiv url: http://arxiv.org/abs/2603.01305v1
Date: Sun, 01 Mar 2026 22:25:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-03 19:50:56.619122
Title: AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models
Title（参考訳）: AG-VAS:大規模マルチモーダルモデルによるアンカーガイド型ゼロショット視覚異常分割
Authors: Zhen Qu, Xian Tao, Xiaoyi Bao, Dingrong Wang, ShiChen Qu, Zhengtao Zhang, Xingang Wang,
Abstract要約: AG-VAS(Anchor-Guided Visual Anomaly)は、3つの学習可能なセマンティックアンカートークンでLMM語彙を拡張する新しいフレームワークである。 AG-VASはゼロショット設定で一貫した最先端のパフォーマンスを達成する。
参考スコア（独自算出の注目度）: 21.682989096955467
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large multimodal models (LMMs) exhibit strong task generalization capabilities, offering new opportunities for zero-shot visual anomaly segmentation (ZSAS). However, existing LMM-based segmentation approaches still face fundamental limitations: anomaly concepts are inherently abstract and context-dependent, lacking stable visual prototypes, and the weak alignment between high-level semantic embeddings and pixel-level spatial features hinders precise anomaly localization. To address these challenges, we present AG-VAS (Anchor-Guided Visual Anomaly Segmentation), a new framework that expands the LMM vocabulary with three learnable semantic anchor tokens-[SEG], [NOR], and [ANO], establishing a unified anchor-guided segmentation paradigm. Specifically, [SEG] serves as an absolute semantic anchor that translates abstract anomaly semantics into explicit, spatially grounded visual entities (e.g., holes or scratches), while [NOR] and [ANO] act as relative anchors that model the contextual contrast between normal and abnormal patterns across categories. To further enhance cross-modal alignment, we introduce a Semantic-Pixel Alignment Module (SPAM) that aligns language-level semantic embeddings with high-resolution visual features, along with an Anchor-Guided Mask Decoder (AGMD) that performs anchor-conditioned mask prediction for precise anomaly localization. In addition, we curate Anomaly-Instruct20K, a large-scale instruction dataset that organizes anomaly knowledge into structured descriptions of appearance, shape, and spatial attributes, facilitating effective learning and integration of the proposed semantic anchors. Extensive experiments on six industrial and medical benchmarks demonstrate that AG-VAS achieves consistent state-of-the-art performance in the zero-shot setting.
Abstract（参考訳）: 大規模マルチモーダルモデル(LMM)は強力なタスク一般化能力を示し、ゼロショット視覚異常セグメンテーション(ZSAS)の新しい機会を提供する。しかし、既存のLMMベースのセグメンテーションアプローチは、本質的に抽象的でコンテキスト依存であり、安定した視覚プロトタイプが欠如しており、高レベルのセマンティック埋め込みとピクセルレベルの空間的特徴との弱い一致は、正確な異常な局所化を妨げている。これらの課題に対処するために,3つの学習可能なセマンティックアンカートークン-[SEG], [NOR], [ANO]でLMM語彙を拡張する新しいフレームワーク AG-VAS(Anchor-Guided Visual Anomaly Segmentation)を提案する。特に[SEG]は、抽象的な異常なセマンティクスを明示的で空間的に接した視覚的実体(例えば、穴や傷)に変換する絶対的なセマンティクスアンカーとして機能し、[NOR]と[ANO]は、カテゴリー間での正常パターンと異常パターンのコンテキストコントラストをモデル化する相対的なアンカーとして機能します。クロスモーダルアライメントをさらに強化するために,言語レベルのセマンティックなセマンティック・アライメント・モジュール (SPAM) と,高精度な局所化のためのアンカー条件付きマスク予測を行うアンカーガイドマスクデコーダ (AGMD) を導入する。さらに,Anomaly-Instruct20Kは,異常知識を外観,形状,空間的属性の構造化記述に整理し,効果的な学習とセマンティックアンカーの統合を支援する大規模命令データセットである。 6つの産業用および医療用ベンチマークの大規模な実験は、AG-VASがゼロショット環境で一貫した最先端性能を達成することを示した。

論文の概要: AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models

関連論文リスト