Fugu-MT 論文翻訳(概要): Towards Mechanistic Defenses Against Typographic Attacks in CLIP

論文の概要: Towards Mechanistic Defenses Against Typographic Attacks in CLIP

arxiv url: http://arxiv.org/abs/2508.20570v1
Date: Thu, 28 Aug 2025 09:08:30 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-29 18:12:02.259895
Title: Towards Mechanistic Defenses Against Typographic Attacks in CLIP
Title（参考訳）: CLIPにおけるタイポグラフィー攻撃に対する機械的防御に向けて
Authors: Lorenz Hufe, Constantin Venhoff, Maximilian Dreyer, Sebastian Lapuschkin, Wojciech Samek,
Abstract要約: タイポグラフィー攻撃下でのCLIP視覚エンコーダの挙動を解析する。タイポグラフィ回路を選択的にブレイすることで,CLIPモデルに対するタイポグラフィ攻撃に対する防御手法を提案する。タイポグラフィー攻撃に対して極めて堅牢なDyslexic CLIPモデル群をリリースする。
参考スコア（独自算出の注目度）: 23.69564867168339
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, our method improves performance by up to 19.6% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
Abstract（参考訳）: タイポグラフィー攻撃は、画像にテキストを注入することでマルチモーダルシステムを悪用し、ターゲットの誤分類、悪意のあるコンテンツ生成、さらにはビジョンランゲージモデルジェイルブレイクに至る。本研究では,CLIP視覚エンコーダのタイポグラフィー攻撃時の動作を解析し,入力情報をclsトークンに因果的に抽出・送信するモデルの後半層に特別な注意を向ける。これらの知見に基づいて,注意頭からなるタイポグラフィ回路を選択的に非難することで,CLIPモデルをタイポグラフィー攻撃から防御する手法を提案する。微調整を必要とせず、標準的なImageNet-100の精度を1%以下に抑えながら、画像Net-100のタイポグラフィ版の性能を最大19.6%向上させる。特に、我々のトレーニングなしのアプローチは、ファインタニングに依存している現在の最先端のタイポグラフィーディフェンスと競合するままです。この目的のために,タイポグラフィー攻撃に対して極めて堅牢なDyslexic CLIPモデル群を作成した。これらのモデルは、テキストベースの操作のリスクが、テキスト認識の有用性を上回るような、幅広い安全クリティカルなアプリケーションに対して、適切なドロップイン置換として機能する。

論文の概要: Towards Mechanistic Defenses Against Typographic Attacks in CLIP

関連論文リスト