Fugu-MT 論文翻訳(概要): Laundering AI Authority with Adversarial Examples

論文の概要: Laundering AI Authority with Adversarial Examples

arxiv url: http://arxiv.org/abs/2605.04261v1
Date: Tue, 05 May 2026 19:55:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-07 18:41:07.525995
Title: Laundering AI Authority with Adversarial Examples
Title（参考訳）: 逆の例でAI機関を洗浄する
Authors: Jie Zhang, Pura Peetathawatchai, Florian Tramèr, Avital Shafran,
Abstract要約: ヴィジュアル言語モデル(VLM)は、ますます信頼できる当局としてデプロイされている。我々は、敵の例がこの仮定を破り、EmphAIの権威洗浄を可能にしたことを示す。私たちの攻撃はモデルアライメントを損なうものではない。
参考スコア（独自算出の注目度）: 32.761654180537434
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) are increasingly deployed as trusted authorities -- fact-checking images on social media, comparing products, and moderating content. Users implicitly trust that these systems perceive the same visual content as they do. We show that adversarial examples break this assumption, enabling \emph{AI authority laundering}: an attacker subtly perturbs an image so that the VLM produces confident and authoritative responses about the \emph{wrong} input. Unlike jailbreaks or prompt injections, our attacks do not compromise model alignment; the attack operates entirely at the perceptual level. We demonstrate that standard attacks against publicly available CLIP models transfer reliably to production VLMs -- including GPT-5.4, Claude Opus~4.6, Gemini~3, and Grok~4.2. Across four attack surfaces, we show that authority laundering can amplify misinformation, disparage individuals, evade content moderation, and manipulate product recommendations. Our attacks have high success rates: In hundreds of attacks targeting identity manipulation and NSFW evasion, we measure success rates of $22 - 100\%$ across six models. No novel attack algorithm is required: basic techniques known for over a decade suffice, establishing a lower bound on attacker capability that should concern defenders. Our results demonstrate that visual adversarial robustness is now a practical -- and still largely unsolved -- safety problem.
Abstract（参考訳）: ヴィジュアル言語モデル(VLM)は、ソーシャルメディア上でのファクトチェック、製品の比較、コンテンツのモデレーションなど、信頼できる当局としてますます多くデプロイされている。ユーザーはこれらのシステムが自分と同じ視覚的コンテンツを認識していることを暗黙的に信じている。攻撃者はイメージを微妙に摂動させ、VLMは \emph{wrong} 入力に対して自信的で権威的な応答を発生させる。ジェイルブレイクやプロンプトインジェクションとは異なり、我々の攻撃はモデルアライメントを損なうことはない。 GPT-5.4、Claude Opus~4.6、Gemini~3、Grok~4.2など、公開可能なCLIPモデルに対する標準的な攻撃が、プロダクションVLMに確実に転送されることを示す。 4つの攻撃面にまたがって,不正情報の増幅,個人分離,コンテンツモデレーションの回避,製品レコメンデーションの操作が可能であることを示す。アイデンティティ操作とNSFW回避を狙った何百もの攻撃では、6つのモデルで22～100\%の成功率を測定します。新たな攻撃アルゴリズムは必要とされない。10年以上にわたって知られていた基本的なテクニックは、攻撃能力の低い境界を確立し、防御者を心配するべきである。我々の結果は、現在、視覚的敵意の堅牢性は、実用的で、まだほとんど未解決の、安全問題であることを示している。

論文の概要: Laundering AI Authority with Adversarial Examples

関連論文リスト