Fugu-MT 論文翻訳(概要): Towards Fine-Grained Robustness: Attention-Guided Test-Time Prompt Tuning for Vision-Language Models

論文の概要: Towards Fine-Grained Robustness: Attention-Guided Test-Time Prompt Tuning for Vision-Language Models

arxiv url: http://arxiv.org/abs/2605.19956v1
Date: Tue, 19 May 2026 15:10:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.464291
Title: Towards Fine-Grained Robustness: Attention-Guided Test-Time Prompt Tuning for Vision-Language Models
Title（参考訳）: 微粒化ロバスト性に向けて:視覚言語モデルのための注意誘導テスト時間プロンプトチューニング
Authors: Jia-Wei Hai, Yijun Wang, Xiu-Shen Wei,
Abstract要約: A-TPTはテスト時間適応のために設計されたセマンティックス保存法である。まず、敵攻撃下で生存する意味的に意味のある領域を特定するために、勾配注意ロールアウト機構を改良する。そこで我々は,空間的に異なる拡張強度と多視点アンサンブルを誘導し,迅速なチューニングと推論を行う。
参考スコア（独自算出の注目度）: 22.43559255963294
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs), such as CLIP, have achieved significant zero-shot performance on downstream tasks with various fine-tuning adaptation methods. However, recent studies have proven that adversarial attacks can significantly degrade the inference ability of VLMs, posing substantial risks to their practical applications. Prevalent test-time adaptation methods typically rely on multi-view augmentation to implement various fine-tuning strategies, which struggle to identify semantic information and are prone to destroying discriminative regions in fine-grained scenarios. To address these limitations, we propose Attention-Guided Test-Time Prompt Tuning (A-TPT), a semantics-preserving method designed for test-time adaptation. We first refine the gradient attention rollout mechanism to identify semantically meaningful regions surviving under adversarial attacks. Furthermore, we leverage them to guide the spatially varying augmentation intensities and multi-view ensemble for prompt tuning and inference. Extensive experiments demonstrate that A-TPT outperforms existing test-time adaptation methods on both adversarial and clean data. Codes are available at https://github.com/SEU-VIPGroup/A-TPT .
Abstract（参考訳）: CLIPのような視覚言語モデル(VLM)は、様々な微調整アダプティブ手法を用いて下流タスクにおいて大きなゼロショット性能を実現している。しかし、近年の研究では、敵対的攻撃はVLMの推論能力を著しく低下させ、その実践的応用に重大なリスクをもたらすことが証明されている。一般的なテスト時間適応法は、様々な微調整戦略を実装するために、多面的な拡張に依存しており、セマンティックな情報を識別するのに苦労し、きめ細かいシナリオで識別領域を破壊するのが困難である。これらの制約に対処するため,テスト時間適応のためのセマンティクス保存手法であるAttention-Guided Test-Time Prompt Tuning (A-TPT)を提案する。まず、敵攻撃下で生存する意味的に意味のある領域を特定するために、勾配注意ロールアウト機構を改良する。さらに,空間的に異なる拡張強度と多視点アンサンブルを誘導し,迅速なチューニングと推論を行う。 A-TPTは、逆データとクリーンデータの両方において、既存のテスト時間適応法よりも優れていることを示す。コードはhttps://github.com/SEU-VIPGroup/A-TPT で公開されている。

論文の概要: Towards Fine-Grained Robustness: Attention-Guided Test-Time Prompt Tuning for Vision-Language Models

関連論文リスト