Fugu-MT 論文翻訳(概要): Jailbreaking Multimodal Large Language Models using Multi-Clip Video

論文の概要: Jailbreaking Multimodal Large Language Models using Multi-Clip Video

arxiv url: http://arxiv.org/abs/2606.02111v1
Date: Mon, 01 Jun 2026 11:43:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:31.894916
Title: Jailbreaking Multimodal Large Language Models using Multi-Clip Video
Title（参考訳）: マルチクリップビデオによるマルチモーダル大言語モデルのジェイルブレーク
Authors: Choongwon Kang, Seungjong Sun, Hyunmin Jun, Jang Hyun Kim,
Abstract要約: 我々は,ビデオ入力の多様性がMLLMの脆弱性に与える影響を評価するために,2,920本の動画のデータセットであるMulti-Clip Video (MCV) SafetyBenchを紹介する。 8つの代表的ビデオMLLMの実験では、クリップ数によって攻撃の成功が一貫して増加することが示されている。その結果,映像のモダリティは(1)画像のモダリティよりも脆弱であり,(2)静的ビデオよりもダイナミックビデオに脆弱であり,(3)動画がより多様なコンテキストを含む場合,より脆弱であることが示唆された。
参考スコア（独自算出の注目度）: 2.1309627532459037
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for malicious misuse. Prior jailbreak studies have shown that safety alignment in MLLMs can be bypassed through visual inputs, yet it remains unclear which properties of video inputs induce this vulnerability. To address this gap, we introduce Multi-Clip Video (MCV) SafetyBench, a dataset of 2,920 videos designed to evaluate how the diversity of video inputs affects the vulnerability of MLLMs. Each video consists of multiple short clips depicting diverse contexts related to a harmful query. Experiments on eight representative video MLLMs show that attack success consistently increases with the number of clips. Our results further indicate that the video modality is (1) more vulnerable than the image modality, (2) more vulnerable to dynamic videos than to static videos, and (3) more vulnerable when videos contain more diverse contexts. Building on these findings, we propose a defense strategy that leverages the relative robustness of the image modality.
Abstract（参考訳）: マルチモーダルな大言語モデル(MLLM)はビデオ入力を処理するために進歩してきたため、悪意のある誤用の可能性について懸念が高まっている。以前のジェイルブレイク研究では、MLLMの安全アライメントは視覚入力によってバイパス可能であることが示されているが、ビデオ入力のどの特性がこの脆弱性を引き起こすかは定かではない。このギャップに対処するために,ビデオ入力の多様性がMLLMの脆弱性に与える影響を評価するために設計された,2,920本のビデオのデータセットであるMulti-Clip Video (MCV) SafetyBenchを紹介する。各ビデオは、有害なクエリに関連するさまざまなコンテキストを描写した複数の短いクリップで構成されている。 8つの代表的ビデオMLLMの実験では、クリップ数によって攻撃の成功が一貫して増加することが示されている。さらに,ビデオのモダリティは,(1)画像のモダリティよりも脆弱であり,(2)静的なビデオよりもダイナミックなビデオに対して脆弱であり,(3)ビデオがより多様なコンテキストを含む場合,より脆弱であることを示す。これらの知見に基づいて,画像の相対的ロバスト性を活用する防衛戦略を提案する。

論文の概要: Jailbreaking Multimodal Large Language Models using Multi-Clip Video

関連論文リスト