Fugu-MT 論文翻訳(概要): Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness

論文の概要: Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness

arxiv url: http://arxiv.org/abs/2603.18598v1
Date: Thu, 19 Mar 2026 08:11:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-20 17:19:06.024492
Title: Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness
Title（参考訳）: ゼロショット対向ロバスト性に対する補完的テキストガイドによる注意
Authors: Lu Yu, Haiyang Zhang, Changsheng Xu,
Abstract要約: ゼロショットロバストネス(TGA-ZSR)のためのテキストガイド型アテンションを提案する。我々のゴールは、CLIPモデルの一般化を維持し、敵の堅牢性を高めることである。この制限を克服するために、補完的テキストガイド注意(Complementary Text-Guided Attention, Comp-TGA)と呼ばれる新しいアプローチを提案する。
参考スコア（独自算出の注目度）: 57.104158692005775
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Due to the impressive zero-shot capabilities, pre-trained vision-language models (e.g., CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible to adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: Local Attention Refinement Module and Global Attention Constraint Module. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness. Additionally, the Global Attention Constraint Module acquires text-guided attention from both the target and original models using clean examples. Its objective is to maintain model performance on clean samples while enhancing overall robustness. However, we observe that the method occasionally focuses on irrelevant or spurious features, which can lead to suboptimal performance and undermine its robustness in certain scenarios. To overcome this limitation, we further propose a novel approach called Complementary Text-Guided Attention (Comp-TGA). This method integrates two types of foreground attention: attention guided by the class prompt and reversed attention driven by the non-class prompt. These complementary attention mechanisms allow the model to capture a more comprehensive and accurate representation of the foreground. The experiments validate that TGA-ZSR and Comp-TGA yield 9.58% and 11.95% improvements respectively, in zero-shot robust accuracy over the current state-of-the-art techniques across 16 datasets.
Abstract（参考訳）: 印象的なゼロショット機能のため、事前訓練されたビジョン言語モデル(例えばCLIP)は、さまざまなドメインで広く注目を集め、採用されている。それでも、CLIPは敵の例に影響を受けやすいことが観察されている。実験分析により, 逆方向の摂動がテキスト誘導による注意の変化を誘発する現象が観察された。そこで本研究では,ゼロショットロバストネス(TGA-ZSR)のためのテキストガイド型注意(Text-Guided Attention for Zero-Shot Robustness, TGA-ZSR)を提案する。このフレームワークには、Local Attention Refinement ModuleとGlobal Attention Constraint Moduleという2つのコンポーネントが含まれている。我々のゴールは、CLIPモデルの一般化を維持し、敵の堅牢性を高めることである。さらに、Global Attention Constraint Moduleは、クリーンな例を使ってターゲットモデルとオリジナルモデルの両方からテキスト誘導の注意を得る。その目的は、全体的堅牢性を高めながら、クリーンなサンプル上でのモデルパフォーマンスを維持することである。しかし,本手法は時として,不適切な,あるいは刺激的な特徴に焦点をあてることによって,最適化性能が低下し,特定のシナリオにおいてその堅牢性を損なう可能性がある。この制限を克服するために、補足テキストガイド注意(Complementary Text-Guided Attention, Comp-TGA)と呼ばれる新しいアプローチを提案する。この方法は、クラスプロンプトによって誘導される注意と、非クラスプロンプトによって誘導される逆アテンションの2つのタイプのフォアグラウンドアテンションを統合する。これらの補完的な注意機構により、モデルはフォアグラウンドのより包括的で正確な表現をキャプチャできる。実験では、TGA-ZSRとComp-TGAがそれぞれ9.58%と11.95%の改善を達成し、16のデータセットにわたる現在の最先端技術に対してゼロショットの堅牢な精度で達成した。

論文の概要: Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness

関連論文リスト