Fugu-MT 論文翻訳(概要): Involuntary Jailbreak

論文の概要: Involuntary Jailbreak

arxiv url: http://arxiv.org/abs/2508.13246v1
Date: Mon, 18 Aug 2025 10:38:30 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-20 15:36:31.685667
Title: Involuntary Jailbreak
Title（参考訳）: 不随意脱獄
Authors: Yangyang Guo, Yangyan Li, Mohan Kankanhalli,
Abstract要約: 我々は,大規模言語モデル (LLM) に新たな脆弱性を提示し,これをtextbfinvoluntary jailbreak と呼ぶ。既存のジェイルブレイク攻撃とは異なり、この弱点は爆弾をテキスト化するための命令を生成するなど、特定の攻撃目標を含まない。我々はLSMに対して、通常拒否されるであろういくつかの質問とそれに対応する詳細な応答を生成するよう指示する。注目すべきは、この単純なプロンプト戦略は、Claude Opus 4.1、Grok 4、Gemini 2.5 Pro、GPT 4.1を含む主要なLCMの大多数を継続的にジェイルブレイクさせることである。
参考スコア（独自算出の注目度）: 11.078631999104907
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this study, we disclose a worrying new vulnerability in Large Language Models (LLMs), which we term \textbf{involuntary jailbreak}. Unlike existing jailbreak attacks, this weakness is distinct in that it does not involve a specific attack objective, such as generating instructions for \textit{building a bomb}. Prior attack methods predominantly target localized components of the LLM guardrail. In contrast, involuntary jailbreaks may potentially compromise the entire guardrail structure, which our method reveals to be surprisingly fragile. We merely employ a single universal prompt to achieve this goal. In particular, we instruct LLMs to generate several questions that would typically be rejected, along with their corresponding in-depth responses (rather than a refusal). Remarkably, this simple prompt strategy consistently jailbreaks the majority of leading LLMs, including Claude Opus 4.1, Grok 4, Gemini 2.5 Pro, and GPT 4.1. We hope this problem can motivate researchers and practitioners to re-evaluate the robustness of LLM guardrails and contribute to stronger safety alignment in future.
Abstract（参考訳）: 本研究では,Large Language Models (LLMs) における新たな脆弱性を明らかにする。既存のjailbreak攻撃とは異なり、この弱点は特定の攻撃目標を含まないため、例えば \textit{build a bomb} の命令を生成するなどである。以前の攻撃方法は、主にLLMガードレールの局所化コンポーネントをターゲットにしていた。それとは対照的に、不随意の脱獄はガードレール全体の構造を損なう可能性がある。私たちはただ一つの普遍的なプロンプトを使ってこの目標を達成するだけです。特に, LLMに対して, 拒否されるような質問をいくつか生成するように指示する。注目すべきは、この単純なプロンプト戦略は、Claude Opus 4.1、Grok 4、Gemini 2.5 Pro、GPT 4.1を含む主要なLCMの大多数を継続的にジェイルブレイクさせることである。我々は,LLMガードレールのロバスト性を再評価し,将来の安全性向上に寄与することが,研究者や実践者の動機となることを願っている。

論文の概要: Involuntary Jailbreak

関連論文リスト