Fugu-MT 論文翻訳(概要): Token Coordinated Prompt Attention is Needed for Visual Prompting

論文の概要: Token Coordinated Prompt Attention is Needed for Visual Prompting

arxiv url: http://arxiv.org/abs/2505.02406v1
Date: Mon, 05 May 2025 06:59:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-06 18:49:35.581079
Title: Token Coordinated Prompt Attention is Needed for Visual Prompting
Title（参考訳）: 視覚プロンプトにおけるトークン協調型プロンプトアテンションの必要性
Authors: Zichen Liu, Xu Zou, Gang Hua, Jiahuan Zhou,
Abstract要約: 本稿では,Token Coordinated Prompt Attention (TCPA)モジュールを提案する。我々はこれらのプロンプトをCLS PromptsとImage Promptsに切り離し、注意機構を通じてCLSトークンや画像トークンとのみ対話する。異なる画像トークンは異なる画像パッチに対応し、多様な情報を含むので、一致したプロンプトを個別のトークンに自動的に割り当てる。
参考スコア（独自算出の注目度）: 28.018671250553137
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual prompting techniques are widely used to efficiently fine-tune pretrained Vision Transformers (ViT) by learning a small set of shared prompts for all tokens. However, existing methods overlook the unique roles of different tokens in conveying discriminative information and interact with all tokens using the same prompts, thereby limiting the representational capacity of ViT. This often leads to indistinguishable and biased prompt-extracted features, hindering performance. To address this issue, we propose a plug-and-play Token Coordinated Prompt Attention (TCPA) module, which assigns specific coordinated prompts to different tokens for attention-based interactions. Firstly, recognizing the distinct functions of CLS and image tokens-global information aggregation and local feature extraction, we disentangle the prompts into CLS Prompts and Image Prompts, which interact exclusively with CLS tokens and image tokens through attention mechanisms. This enhances their respective discriminative abilities. Furthermore, as different image tokens correspond to distinct image patches and contain diverse information, we employ a matching function to automatically assign coordinated prompts to individual tokens. This enables more precise attention interactions, improving the diversity and representational capacity of the extracted features. Extensive experiments across various benchmarks demonstrate that TCPA significantly enhances the diversity and discriminative power of the extracted features. The code is available at https://github.com/zhoujiahuan1991/ICML2025-TCPA.
Abstract（参考訳）: 視覚プロンプト技術は、全てのトークンの共有プロンプトの小さなセットを学習することで、視覚トランスフォーマー(ViT)を効率的に微調整するために広く用いられている。しかし、既存の方法では、識別情報を伝達し、同じプロンプトを使って全てのトークンとやり取りする際、異なるトークンのユニークな役割を見落としているため、ViTの表現能力は制限される。これはしばしば区別不能でバイアスのかかるプロンプト抽出機能につながり、パフォーマンスを損なう。この問題に対処するために,特定の調整されたプロンプトを異なるトークンに割り当て,注目に基づくインタラクションを行う,Token Coordinated Prompt Attention (TCPA)モジュールを提案する。まず,CLSと画像トークン・グローバル情報集約と局所特徴抽出の異なる機能を認識し,そのプロンプトをCLS PromptsとImage Promptsに切り離し,注意機構を通じてCLSトークンや画像トークンとのみ対話する。これにより、それぞれの識別能力が向上する。さらに、異なる画像トークンが異なる画像パッチに対応し、多様な情報を含むので、一致したプロンプトを個別のトークンに自動的に割り当てる。これにより、より正確な注意相互作用が可能になり、抽出された特徴の多様性と表現能力が改善される。様々なベンチマークによる大規模な実験により、TPAは抽出された特徴の多様性と識別力を大幅に向上させることが示された。コードはhttps://github.com/zhoujiahuan 1991/ICML2025-TCPAで公開されている。

論文の概要: Token Coordinated Prompt Attention is Needed for Visual Prompting

関連論文リスト