Fugu-MT 論文翻訳(概要): VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands

論文の概要: VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands

arxiv url: http://arxiv.org/abs/2510.09699v1
Date: Thu, 09 Oct 2025 16:18:31 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:29.569061
Title: VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands
Title（参考訳）: VisualDAN: ビジュアル駆動型DANコマンドによるVLMの脆弱性の公開
Authors: Aofan Liu, Lulu Tang,
Abstract要約: この研究は、DANスタイルのコマンドに埋め込まれた単一の逆画像であるVisualDANを導入している。我々は、有害なコーパスに肯定的なプレフィックスを付与し、モデルに悪質なクエリに正の反応をさせる。この結果から, 少量の有害物質であっても, モデルの防御が損なわれれば, 有害なアウトプットを著しく増幅できることが示唆された。
参考スコア（独自算出の注目度）: 5.1114671756882535
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision-Language Models (VLMs) have garnered significant attention for their remarkable ability to interpret and generate multimodal content. However, securing these models against jailbreak attacks continues to be a substantial challenge. Unlike text-only models, VLMs integrate additional modalities, introducing novel vulnerabilities such as image hijacking, which can manipulate the model into producing inappropriate or harmful responses. Drawing inspiration from text-based jailbreaks like the "Do Anything Now" (DAN) command, this work introduces VisualDAN, a single adversarial image embedded with DAN-style commands. Specifically, we prepend harmful corpora with affirmative prefixes (e.g., "Sure, I can provide the guidance you need") to trick the model into responding positively to malicious queries. The adversarial image is then trained on these DAN-inspired harmful texts and transformed into the text domain to elicit malicious outputs. Extensive experiments on models such as MiniGPT-4, MiniGPT-v2, InstructBLIP, and LLaVA reveal that VisualDAN effectively bypasses the safeguards of aligned VLMs, forcing them to execute a broad range of harmful instructions that severely violate ethical standards. Our results further demonstrate that even a small amount of toxic content can significantly amplify harmful outputs once the model's defenses are compromised. These findings highlight the urgent need for robust defenses against image-based attacks and offer critical insights for future research into the alignment and security of VLMs.
Abstract（参考訳）: VLM(Vision-Language Models)は、マルチモーダルコンテンツの解釈と生成に際し、重要な注目を集めている。しかし、これらのモデルをジェイルブレイク攻撃に対して確保することは、依然として重大な課題である。テキストのみのモデルとは異なり、VLMは追加のモダリティを統合し、画像ハイジャックのような新たな脆弱性を導入する。 Do Anything Now"(DAN)コマンドのようなテキストベースのジェイルブレイクからインスピレーションを得て、この作業では、DANスタイルのコマンドに埋め込まれた単一の逆画像であるVisualDANを導入する。具体的には、有害なコーパスに肯定的なプレフィックス(例えば、"Sure, I can provide the guidance you need")を加えて、モデルを騙して悪意のあるクエリに積極的に応答させます。敵画像は、これらのDANにインスパイアされた有害なテキストに基づいて訓練され、悪意のある出力を引き出すためにテキストドメインに変換される。 MiniGPT-4、MiniGPT-v2、InstructBLIP、LLaVAといったモデルに対する大規模な実験では、VisualDANが協調VLMの安全を効果的に回避し、倫理基準を厳しく違反する広範囲の有害な命令を実行せざるを得ないことが明らかになった。さらに, 有害物質が少量であっても, モデルの防御が損なわれれば, 有害なアウトプットを著しく増幅できることを示した。これらの知見は、画像ベースの攻撃に対する堅牢な防御の必要性を強調し、VLMのアライメントとセキュリティに関する今後の研究に重要な洞察を提供する。

論文の概要: VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands

関連論文リスト