論文の概要: Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
- arxiv url: http://arxiv.org/abs/2407.15211v2
- Date: Mon, 16 Dec 2024 01:20:42 GMT
- ステータス: 翻訳完了
- システム内更新日: 2024-12-17 13:53:34.684280
- Title: Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
- Title(参考訳): 視覚言語モデル間の移動可能な画像ジェイルブレークの発見に失敗
- Authors: Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristóbal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, Ethan Perez,
- Abstract要約: 視覚およびテキスト入力に条件付けされたテキスト出力を生成する視覚言語モデル(VLM)の一般的なクラスに焦点を当てる。
- 参考スコア(独自算出の注目度): 20.385314634225978
- License:
- Abstract: The integration of new modalities into frontier AI systems offers exciting capabilities, but also increases the possibility such systems can be adversarially manipulated in undesirable ways. In this work, we focus on a popular class of vision-language models (VLMs) that generate text outputs conditioned on visual and textual inputs. We conducted a large-scale empirical study to assess the transferability of gradient-based universal image ``jailbreaks" using a diverse set of over 40 open-parameter VLMs, including 18 new VLMs that we publicly release. Overall, we find that transferable gradient-based image jailbreaks are extremely difficult to obtain. When an image jailbreak is optimized against a single VLM or against an ensemble of VLMs, the jailbreak successfully jailbreaks the attacked VLM(s), but exhibits little-to-no transfer to any other VLMs; transfer is not affected by whether the attacked and target VLMs possess matching vision backbones or language models, whether the language model underwent instruction-following and/or safety-alignment training, or many other factors. Only two settings display partially successful transfer: between identically-pretrained and identically-initialized VLMs with slightly different VLM training data, and between different training checkpoints of a single VLM. Leveraging these results, we then demonstrate that transfer can be significantly improved against a specific target VLM by attacking larger ensembles of ``highly-similar" VLMs. These results stand in stark contrast to existing evidence of universal and transferable text jailbreaks against language models and transferable adversarial attacks against image classifiers, suggesting that VLMs may be more robust to gradient-based transfer attacks.
- Abstract(参考訳): 新たなモダリティをフロンティアAIシステムに統合することは、エキサイティングな機能を提供すると同時に、そのようなシステムが好ましくない方法で敵に操作される可能性も高めている。
本研究では,視覚とテキストの入力を条件としたテキスト出力を生成する視覚言語モデル (VLM) に焦点をあてる。
- IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves [67.30731020715496]
ブラックボックスのジェイルブレイク攻撃に対して,悪意のある画像テキストペアを自動生成する新しいジェイルブレイク手法 IDEATOR を提案する。
平均5.34クエリでMiniGPT-4をジェイルブレイクし、LLaVA、InstructBLIP、Meta's Chameleonに転送すると82%、88%、75%という高い成功率を達成した。
論文 参考訳(メタデータ) (2024-10-29T07:15:56Z) - AnyAttack: Towards Large-scale Self-supervised Generation of Targeted Adversarial Examples for Vision-Language Models [41.044385916368455]
VLM(Vision-Language Models)は、画像ベースの敵攻撃に対して脆弱である。
論文 参考訳(メタデータ) (2024-10-07T09:45:18Z) - Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
論文 参考訳(メタデータ) (2024-06-06T13:00:42Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
論文 参考訳(メタデータ) (2024-05-28T07:13:30Z) - Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks [41.213482317141356]
論文 参考訳(メタデータ) (2024-05-07T15:29:48Z) - Jailbreaking Attack against Multimodal Large Language Model [69.52466793164618]
imgJP (emphimage Jailbreaking Prompt) の探索手法を提案する。
提案手法は, 生成したimgJPをジェイルブレイクモデルに転送できるため, 強いモデル伝達性を示す。
論文 参考訳(メタデータ) (2024-02-04T01:29:24Z) - Universal and Transferable Adversarial Attacks on Aligned Language
Models [118.41733208825278]
論文 参考訳(メタデータ) (2023-07-27T17:49:12Z) - Set-level Guidance Attack: Boosting Adversarial Transferability of
Vision-Language Pre-training Models [52.530286579915284]
本稿では,高度に伝達可能なSGA(Set-level Guidance Attack)を提案する。
論文 参考訳(メタデータ) (2023-07-26T09:19:21Z)