Fugu-MT 論文翻訳(概要): ShallowJail: Steering Jailbreaks against Large Language Models

論文の概要: ShallowJail: Steering Jailbreaks against Large Language Models

arxiv url: http://arxiv.org/abs/2602.07107v1
Date: Fri, 06 Feb 2026 18:35:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-10 20:26:24.45347
Title: ShallowJail: Steering Jailbreaks against Large Language Models
Title（参考訳）: ShallowJail: 大規模言語モデルに対するジェイルブレークのステアリング
Authors: Shang Liu, Hanyu Pei, Zeyan Liu,
Abstract要約: LLMの浅いアライメントを利用する新たな攻撃であるShallowJailを紹介する。 ShallowJailは、推論中に初期トークンを操作することで、LSMのレスポンスを誤操作することができる。広汎な実験により,最先端のLCM応答の安全性を著しく低下させるシャローの有効性を実証した。
参考スコア（独自算出の注目度）: 7.9152592631238425
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models(LLMs) have been successful in numerous fields. Alignment has usually been applied to prevent them from harmful purposes. However, aligned LLMs remain vulnerable to jailbreak attacks that deliberately mislead them into producing harmful outputs. Existing jailbreaks are either black-box, using carefully crafted, unstealthy prompts, or white-box, requiring resource-intensive computation. In light of these challenges, we introduce ShallowJail, a novel attack that exploits shallow alignment in LLMs. ShallowJail can misguide LLMs' responses by manipulating the initial tokens during inference. Through extensive experiments, we demonstrate the effectiveness of~\shallow, which substantially degrades the safety of state-of-the-art LLM responses.
Abstract（参考訳）: 大規模言語モデル(LLM)は多くの分野で成功している。通常、アライメントは有害な目的から防ぐために適用される。しかし、LLMのアライメントは、故意にそれらを有害な出力に誤解させるジェイルブレイク攻撃に弱いままである。既存のジェイルブレイクはブラックボックスで、慎重に作り直され、不便なプロンプトを使うか、ホワイトボックスで、リソース集約的な計算を必要とする。これらの課題を踏まえて、LLMの浅いアライメントを利用する新たな攻撃であるShallowJailを紹介します。 ShallowJailは、推論中に初期トークンを操作することで、LSMのレスポンスを誤操作することができる。広汎な実験により, 最先端LCM応答の安全性を著しく低下させる~\shallowの有効性を実証した。

論文の概要: ShallowJail: Steering Jailbreaks against Large Language Models

関連論文リスト