Fugu-MT 論文翻訳(概要): Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization

論文の概要: Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization

arxiv url: http://arxiv.org/abs/2605.26501v1
Date: Tue, 26 May 2026 03:28:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:41.603354
Title: Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization
Title（参考訳）: 視覚言語モデルの脆弱性を解消する: テクスチュア制約摂動とクロスモーダル最適化によるマルチモーダル逆相乗法
Authors: Xiang Fang, Wanlong Fang, Changshuo Wang,
Abstract要約: 我々は,LVLMに対する汎用的でブラックボックスなマルチモーダル攻撃を実現するフレームワークであるMulti-Modal Adrial Synergyを紹介する。 MMASは、画像に対する普遍的な逆摂動と、テキストに対する学習可能な即時摂動を同時に生成する。本実験は,LVLMを用いた攻撃の強い普遍的対角能力を示すものである。
参考スコア（独自算出の注目度）: 15.851694572297612
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Vision-Language Models (LVLMs) have transformed multi-modal understanding, excelling in tasks like image captioning and visual question answering by integrating visual and textual inputs. However, their robustness against adversarial attacks, particularly those exploiting both modalities, remains underexplored, posing risks to critical applications like autonomous driving and content moderation. Existing attacks focus on single modalities or require impractical white-box access, limiting their real-world relevance. In this paper, we introduce Multi-Modal Adversarial Synergy, a groundbreaking framework that crafts universal, black-box multi-modal attacks against LVLMs. MMAS simultaneously generates a texture scale-constrained universal adversarial perturbation for images and a learnable prompt perturbation for text, optimized jointly using only model queries. The image perturbation leverages wavelet-based texture constraints to ensure imperceptibility and robustness across diverse visual inputs. The text perturbation, constrained by an L-norm in the embedding space, maintains semantic coherence while steering outputs toward a target. A novel cross-modal regularization term aligns the perturbations' gradient directions, enhancing their synergistic impact and transferability across tasks and models. Extensive experiments show the strong universal adversarial capabilities of our proposed attack with prevalent LVLMs.
Abstract（参考訳）: LVLM(Large Vision-Language Models)は、視覚とテキストの入力を統合することで、画像キャプションや視覚的質問応答といったタスクに優れたマルチモーダル理解を変換している。しかし、敵の攻撃、特に両方のモダリティを悪用する攻撃に対する頑強さは未発見のままであり、自動運転やコンテンツモデレーションといった重要な応用にリスクを及ぼす。既存の攻撃は単一のモダリティにフォーカスするか、非現実的なホワイトボックスアクセスを必要とし、現実の関連性を制限している。本稿では,LVLMに対する汎用的でブラックボックスなマルチモーダルアタックを実現する基盤的フレームワークであるMulti-Modal Adversarial Synergyを紹介する。 MMASは、画像に対するテクスチャスケール制約付き普遍的対角摂動と、モデルクエリのみを併用して最適化されたテキストに対する学習可能な即時摂動とを同時に生成する。画像摂動はウェーブレットベースのテクスチャ制約を活用して、多様な視覚入力における不知覚性と堅牢性を保証する。埋め込み空間のLノルムによって制約されたテキスト摂動は、目標に向かって出力を操りながら意味的コヒーレンスを維持する。新たなクロスモーダル正規化項は摂動の勾配方向を整列させ、それらの相乗的影響とタスクやモデル間の伝達可能性を高める。広汎な実験により,提案したLVLMを用いた攻撃の強い普遍的対向能力が示された。

論文の概要: Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization

関連論文リスト