Fugu-MT 論文翻訳(概要): Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

論文の概要: Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

arxiv url: http://arxiv.org/abs/2604.12371v1
Date: Tue, 14 Apr 2026 06:59:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.302761
Title: Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models
Title（参考訳）: ピクセル間の読み上げ:視覚言語モデルにおけるテキスト・イメージ・埋め込みアライメントとタイポグラフィー・アタック・成功のリンク
Authors: Ravikumar Balakrishnan, Sanket Mendapara, Ankit Garg,
Abstract要約: 本稿では,視覚言語モデル (VLM) に対するタイポグラフィー・プロンプト・インジェクション攻撃について検討する。実際には、攻撃面は異種であり、様々なフォントサイズと多様な視覚条件の下で、敵対的なテキストが現れる。
参考スコア（独自算出の注目度）: 4.577407934990345
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study typographic prompt injection attacks on vision-language models (VLMs), where adversarial text is rendered as images to bypass safety mechanisms, posing a growing threat as VLMs serve as the perceptual backbone of autonomous agents, from browser automation and computer-use systems to camera-equipped embodied agents. In practice, the attack surface is heterogeneous: adversarial text appears at varying font sizes and under diverse visual conditions, while the growing ecosystem of VLMs exhibits substantial variation in vulnerability, complicating defensive approaches. Evaluating 1,000 prompts from SALAD-Bench across four VLMs, namely, GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL-4B-Instruct under varying font sizes (6--28px) and visual transformations (rotation, blur, noise, contrast changes), we find: (1) font size significantly affects attack success rate (ASR), with very small fonts (6px) yielding near-zero ASR while mid-range fonts achieve peak effectiveness; (2) text attacks are more effective than image attacks for GPT-4o (36% vs 8%) and Claude (47% vs 22%), while Qwen3-VL and Mistral show comparable ASR across modalities; (3) text-image embedding distance from two multimodal embedding models (JinaCLIP and Qwen3-VL-Embedding) shows strong negative correlation with ASR across all four models (r = -0.71 to -0.93, p < 0.01); (4) heavy degradations increase embedding distance by 10--12% and reduce ASR by 34--96%, while rotation asymmetrically affects models (Mistral drops 50%, GPT-4o unchanged). These findings highlight that model-specific robustness patterns preclude one-size-fits-all defenses and offer empirical guidance for practitioners selecting VLM backbones for agentic systems operating in adversarial environments.
Abstract（参考訳）: 本稿では,視覚言語モデル (VLM) に対するタイポグラフィー・プロンプト・インジェクション攻撃について検討し,VLMがブラウザの自動化やコンピュータ利用システムからカメラ搭載のエンボディエージェントに至るまで,自律エージェントの知覚バックボーンとして機能するにつれて,その脅威が増大することを示す。実際には、攻撃面は異種であり、敵対的テキストは様々なフォントサイズと多様な視覚条件下で現れる一方、VLMのエコシステムの増大は、防御的アプローチを複雑にし、重大な脆弱性を示す。 GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, Qwen3-VL-4B-Instruct under variant font sizes (6--28px) and visual transformations (rotation, blur, noise, contrast change), we found: (1) font size significantly affects attack success rate (ASR), with very small fonts (6px) yielding near-zero ASR while mid-range fonts obtained to peak effective (47% vs 22%), Qwen3-VL and Mistral embeddeds (2～42%), Qwen3-VL and Mistral embeddeds (2～44%), Qwen3-VL-4B- Instructs under variants (6-28px) and visual transformations (rotation, blur, noise, noise, contrast change), we found: 1) font size significantly affects of attack success rate (ASR) with very small fonts (6px) yielding around-zero ASR while mid-zero (67% vs 8%), Claude (47% vs 22%), Qwen3-VL-4BはGPT-4oのイメージ攻撃よりも効果的である。これらの結果から, モデル固有のロバスト性パターンは, 全防御効果を損なうことが示唆され, 対向環境下で動作しているエージェントシステムに対して, VLMバックボーンを選択する実践者に対して実証的なガイダンスが提供される。

関連論文リスト

VisualLeakBench: Auditing the Fragility of Large Vision-Language Models against PII Leakage and Social Engineering [14.756677328512907]
VisualLeakBenchは、OCRインジェクションとContextual PII Leakageに対してLVLMを監査するための評価スイートである。 8種類のPII型を持つ合成逆画像1,000枚を用いて,実世界の実画像50枚に検証を行った。我々は、再現可能な堅牢性と、デプロイメント関連視覚言語システムの安全性評価のためのデータセットとコードをリリースする。
論文参考訳（メタデータ） (2026-03-11T05:47:24Z)
VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models [64.56065206447788]
ビジョン言語モデル(VLM)は、標準の高品質なデータセット上で強力なパフォーマンスを達成する。 VLM-RobustBenchはノイズ、ブラー、天気、デジタル、幾何学にまたがる49種類の拡張型にまたがるベンチマークである。低重度空間摂動は、視覚的に重度な光度劣化よりも、しばしば性能を低下させる。
論文参考訳（メタデータ） (2026-03-06T10:58:02Z)
VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models [19.867040067010674]
本稿では,マルチモーダルなジェイルブレイク発見を,ペア化されたテキストイメージプロンプト上での連立後続分布学習として再放送する変分推論フレームワークVERA-Vを紹介する。我々は、後方に近づいた軽量攻撃者を訓練し、多様なジェイルブレイクの効率的なサンプリングを可能にした。 HarmBenchとHADESベンチマークの実験では、VERA-Vは最先端のベースラインを一貫して上回っている。
論文参考訳（メタデータ） (2025-10-20T17:12:10Z)
On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations [52.1029745126386]
視覚-言語-アクション(VLA)モデルでは、現実世界の摂動に対する堅牢性は、デプロイに不可欠である。本稿では,VLA入力と出力の摂動に対するロバストVLAを提案する。 LIBEROの実験では、ロバストVLAは、pi0バックボーンで12.6%、OpenVLAバックボーンで10.4%のベースラインをはるかに上回っている。
論文参考訳（メタデータ） (2025-09-26T14:42:23Z)
Invisible Injections: Exploiting Vision-Language Models Through Steganographic Prompt Embedding [0.0]
ビジョン言語モデル(VLM)は、マルチモーダルAIアプリケーションに革命をもたらしたが、ほとんど探索されていない新しいセキュリティ脆弱性を導入した。 VLMに対するステガノグラフィー・プロンプト・インジェクション・アタックの最初の包括的研究について述べる。提案手法は,現在のVLMアーキテクチャが通常の画像処理中に不注意に隠されたプロンプトを抽出し,実行可能であることを示す。
論文参考訳（メタデータ） (2025-07-30T00:34:20Z)
REVEAL: Multi-turn Evaluation of Image-Input Harms for Vision LLM [0.098314893665023]
視覚大言語モデル(VLLM)における画像入力障害を評価するためのスケーラブルで自動化されたパイプラインであるREVEAL Frameworkを紹介する。 VLLMs, GPT-4o, Llama-3.2, Qwen2-VL, Phi3.5V, Pixtralの5種を, 性的被害, 暴力, 誤報の3つの重要な危険カテゴリーで評価した。 GPT-4oは、我々の安全ユーザビリティ指標(SUI)で測定された最もバランスの取れた性能を示し、Pixtralに近づいた。
論文参考訳（メタデータ） (2025-05-07T10:09:55Z)
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models [92.79804303337522]
VLM(Vision-Language Models)は、安全アライメントの問題に対して脆弱である。本稿では、シナリオ認識画像生成を利用したセマンティックアライメントのための新しいジェイルブレイクフレームワークであるMLAIを紹介する。大規模な実験はMLAIの重大な影響を示し、MiniGPT-4で77.75%、LLaVA-2で82.80%の攻撃成功率を達成した。
論文参考訳（メタデータ） (2024-11-27T02:40:29Z)
Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors [31.383591942592467]
視覚言語モデル(VLM)は、視覚とテキストのデータを組み合わせて理解と相互作用を強化する革新的な方法を提供する。パッチベースの敵攻撃は、物理的な視覚応用において最も現実的な脅威モデルと考えられている。本研究では,スムージング技術に根ざした防御機構であるSmoothVLMを導入し,VLMをパッチ付き視覚プロンプトインジェクタの脅威から保護する。
論文参考訳（メタデータ） (2024-05-17T04:19:19Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。