Fugu-MT 論文翻訳(概要): The Loupe: A Plug-and-Play Attention Module for Amplifying Discriminative Features in Vision Transformers

論文の概要: The Loupe: A Plug-and-Play Attention Module for Amplifying Discriminative Features in Vision Transformers

arxiv url: http://arxiv.org/abs/2508.16663v1
Date: Wed, 20 Aug 2025 19:07:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-26 18:43:45.0995
Title: The Loupe: A Plug-and-Play Attention Module for Amplifying Discriminative Features in Vision Transformers
Title（参考訳）: The Loupe:視覚変換器の識別機能を増幅するプラグイン・アンド・プレイアテンションモジュール
Authors: Naren Sengodan,
Abstract要約: このモジュールは、Swin Transformerのようなトレーニング済みのバックボーンに挿入されるように設計されている。 Loupeは、モデルを暗黙的に誘導し、最も差別的なオブジェクト部品にフォーカスする複合損失関数でエンドツーエンドに訓練されている。挑戦的なCUB-200-2011データセットに関する実験により、The LoupeはSwin-Baseモデルの精度を85.40%から88.06%に改善し、2.66%の大幅な向上を示した。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fine-Grained Visual Classification (FGVC) is a critical and challenging area within computer vision, demanding the identification of highly subtle, localized visual cues. The importance of FGVC extends to critical applications such as biodiversity monitoring and medical diagnostics, where precision is paramount. While large-scale Vision Transformers have achieved state-of-the-art performance, their decision-making processes often lack the interpretability required for trust and verification in such domains. In this paper, we introduce The Loupe, a novel, lightweight, and plug-and-play attention module designed to be inserted into pre-trained backbones like the Swin Transformer. The Loupe is trained end-to-end with a composite loss function that implicitly guides the model to focus on the most discriminative object parts without requiring explicit part-level annotations. Our unique contribution lies in demonstrating that a simple, intrinsic attention mechanism can act as a powerful regularizer, significantly boosting performance while simultaneously providing clear visual explanations. Our experimental evaluation on the challenging CUB-200-2011 dataset shows that The Loupe improves the accuracy of a Swin-Base model from 85.40% to 88.06%, a significant gain of 2.66%. Crucially, our qualitative analysis of the learned attention maps reveals that The Loupe effectively localizes semantically meaningful features, providing a valuable tool for understanding and trusting the model's decision-making process.
Abstract（参考訳）: Fine-Grained Visual Classification (FGVC) はコンピュータビジョンにおいて重要かつ困難な領域であり、非常に微妙で局所的な視覚的手がかりの識別を要求する。 FGVCの重要性は、生物多様性モニタリングや医療診断などの重要な応用にまで拡張され、精度が最重要である。大規模なビジョントランスフォーマーは最先端のパフォーマンスを達成したが、それらの意思決定プロセスは、そのような領域における信頼と検証に必要な解釈性に欠けることが多い。本稿では,Swin Transformerのようなトレーニング済みのバックボーンに挿入されるように設計された,新規で軽量かつプラグアンドプレイアテンションモジュールであるThe Loupeを紹介する。 Loupeは、明示的な部分レベルのアノテーションを必要とせずに、モデルを最も差別的なオブジェクト部分に集中するように暗黙的にガイドする複合損失関数でエンドツーエンドに訓練されている。我々のユニークな貢献は、単純で本質的な注意機構が強力な正則化器として機能し、性能を大幅に向上し、同時に明確な視覚的説明を提供することである。挑戦的なCUB-200-2011データセットに関する実験により、The LoupeはSwin-Baseモデルの精度を85.40%から88.06%に改善し、2.66%の大幅な向上を示した。重要なことは、学習された注意マップの質的な分析によって、The Loupeが意味論的に意味のある特徴を効果的にローカライズし、モデルの意思決定プロセスを理解し信頼するための貴重なツールを提供することが明らかになった。

論文の概要: The Loupe: A Plug-and-Play Attention Module for Amplifying Discriminative Features in Vision Transformers

関連論文リスト