Fugu-MT 論文翻訳(概要): FC-Attack: Jailbreaking Large Vision-Language Models via Auto-Generated Flowcharts

論文の概要: FC-Attack: Jailbreaking Large Vision-Language Models via Auto-Generated Flowcharts

arxiv url: http://arxiv.org/abs/2502.21059v1
Date: Fri, 28 Feb 2025 13:59:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-03-03 16:38:45.928542
Title: FC-Attack: Jailbreaking Large Vision-Language Models via Auto-Generated Flowcharts
Title（参考訳）: FC-Attack:自動生成フローチャートによる大型ビジョンランゲージモデルの脱獄
Authors: Ziyi Zhang, Zhen Sun, Zongmin Zhang, Jihui Guo, Xinlei He,
Abstract要約: 本稿では,自動生成フローチャートFC-Attackに基づくジェイルブレイク攻撃手法を提案する。 FC-AttackはGemini-1.5、Llaval-Next、Qwen2-VL、InternVL-2.5で90%以上の攻撃成功率を達成した。攻撃を緩和するため、いくつかの防御策を探索し、AdaShieldが脱獄性能を大幅に低下させるが、実用コストは低下する。
参考スコア（独自算出の注目度）: 20.323340637767327
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Vision-Language Models (LVLMs) have become powerful and widely adopted in some practical applications. However, recent research has revealed their vulnerability to multimodal jailbreak attacks, whereby the model can be induced to generate harmful content, leading to safety risks. Although most LVLMs have undergone safety alignment, recent research shows that the visual modality is still vulnerable to jailbreak attacks. In our work, we discover that by using flowcharts with partially harmful information, LVLMs can be induced to provide additional harmful details. Based on this, we propose a jailbreak attack method based on auto-generated flowcharts, FC-Attack. Specifically, FC-Attack first fine-tunes a pre-trained LLM to create a step-description generator based on benign datasets. The generator is then used to produce step descriptions corresponding to a harmful query, which are transformed into flowcharts in 3 different shapes (vertical, horizontal, and S-shaped) as visual prompts. These flowcharts are then combined with a benign textual prompt to execute a jailbreak attack on LVLMs. Our evaluations using the Advbench dataset show that FC-Attack achieves over 90% attack success rates on Gemini-1.5, Llaval-Next, Qwen2-VL, and InternVL-2.5 models, outperforming existing LVLM jailbreak methods. Additionally, we investigate factors affecting the attack performance, including the number of steps and the font styles in the flowcharts. Our evaluation shows that FC-Attack can improve the jailbreak performance from 4% to 28% in Claude-3.5 by changing the font style. To mitigate the attack, we explore several defenses and find that AdaShield can largely reduce the jailbreak performance but with the cost of utility drop.
Abstract（参考訳）: LVLM(Large Vision-Language Models)は、いくつかの実用的な応用において強力で広く採用されている。しかし、最近の研究により、マルチモーダル・ジェイルブレイク攻撃に対する脆弱性が明らかにされており、モデルが有害なコンテンツを生成するために誘導され、安全性のリスクが生じる可能性がある。ほとんどのLVLMは安全アライメントを受けていますが、最近の研究では、視覚的モダリティがまだジェイルブレイク攻撃に弱いことが示されています。本研究では, 部分的に有害な情報を持つフローチャートを使用することで, LVLMを誘導し, 有害な詳細を付加することを発見した。そこで本研究では,自動生成フローチャートFC-Attackに基づくジェイルブレイク攻撃手法を提案する。具体的には、FC-Attackはまず事前訓練されたLLMを微調整し、良質なデータセットに基づいたステップ記述ジェネレータを生成する。次にジェネレータを使用して、有害なクエリに対応するステップ記述を生成し、視覚的なプロンプトとして3つの異なる形状(垂直、水平、S字形)のフローチャートに変換する。これらのフローチャートは、LVLMに対してジェイルブレイク攻撃を実行するための良質なテキストプロンプトと組み合わせられる。本稿では, FC-AttackがGemini-1.5, Llaval-Next, Qwen2-VL, InternVL-2.5モデルに対して90%以上の攻撃成功率を達成し, 既存のLVLMジェイルブレイク法より優れていることを示す。さらに,フローチャートのステップ数やフォントスタイルなど,攻撃性能に影響する要因についても検討する。評価の結果,FC-Attackはフォントスタイルを変えることにより,Claude-3.5のジェイルブレイク性能を4%から28%向上できることがわかった。攻撃を緩和するため、いくつかの防御策を探索し、AdaShieldが脱獄性能を大幅に低下させるが、実用コストは低下する。

論文の概要: FC-Attack: Jailbreaking Large Vision-Language Models via Auto-Generated Flowcharts

関連論文リスト