Fugu-MT 論文翻訳(概要): Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

論文の概要: Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

arxiv url: http://arxiv.org/abs/2603.05772v1
Date: Fri, 06 Mar 2026 00:13:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:44.774047
Title: Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads
Title（参考訳）: ディープ・セーフティ・アテンション・ヘッドから大規模な言語モデルを脱獄させる
Authors: Jinman Wu, Yi Xie, Shiqian Zhao, Xiaofeng Chen,
Abstract要約: 我々は,注目度の高いジェイルブレイクフレームワークであるtextbfunderlineHad textbfunderlineAttack (textbfSAHA)を提案する。
参考スコア（独自算出の注目度）: 6.934057947128395
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Currently, open-sourced large language models (OSLLMs) have demonstrated remarkable generative performance. However, as their structure and weights are made public, they are exposed to jailbreak attacks even after alignment. Existing attacks operate primarily at shallow levels, such as the prompt or embedding level, and often fail to expose vulnerabilities rooted in deeper model components, which creates a false sense of security for successful defense. In this paper, we propose \textbf{\underline{S}}afety \textbf{\underline{A}}ttention \textbf{\underline{H}}ead \textbf{\underline{A}}ttack (\textbf{SAHA}), an attention-head-level jailbreak framework that explores the vulnerability in deeper but insufficiently aligned attention heads. SAHA contains two novel designs. Firstly, we reveal that deeper attention layers introduce more vulnerability against jailbreak attacks. Based on this finding, \textbf{SAHA} introduces \textit{Ablation-Impact Ranking} head selection strategy to effectively locate the most vital layer for unsafe output. Secondly, we introduce a boundary-aware perturbation method, \textit{i.e. Layer-Wise Perturbation}, to probe the generation of unsafe content with minimal perturbation to the attention. This constrained perturbation guarantees higher semantic relevance with the target intent while ensuring evasion. Extensive experiments show the superiority of our method: SAHA improves ASR by 14\% over SOTA baselines, revealing the vulnerability of the attack surface on the attention head. Our code is available at https://anonymous.4open.science/r/SAHA.
Abstract（参考訳）: 現在、オープンソースの大言語モデル(OSLLM)は、顕著な生成性能を示している。しかし、その構造と重みが公にされているため、アライメント後も脱獄攻撃にさらされる。既存の攻撃は、主にプロンプトや埋め込みレベルなどの浅いレベルで動作し、より深いモデルコンポーネントに根ざした脆弱性の暴露に失敗することが多い。本稿では,この脆弱性を深く,かつ十分に整列したアテンションヘッドで調査する注目レベルジェイルブレイクフレームワークである,‘textbf{\underline{S}}afety \textbf{\underline{A}}ttention \textbf{\underline{H}}ead \textbf{\underline{A}}ttack(\textbf{SAHA})を提案する。 SAHAには2つの新しいデザインがある。まず、より深い注意層が、脱獄攻撃に対する脆弱性をより多く導入することを明らかにする。この発見に基づいて、 \textbf{SAHA} は \textit{Ablation-Impact Ranking} のヘッダ選択戦略を導入し、安全でない出力の最も重要なレイヤを効果的に見つける。次に、最小限の摂動を注意に向けて、安全でないコンテンツの生成を探索するために、境界対応摂動法である‘textit{i.e. Layer-Wise摂動’を導入する。この制約された摂動は、回避を確保しながら、目標意図と高い意味的関連性を保証する。 SAHAはSOTAベースラインよりもASRを14倍改善し、アタックヘッドの攻撃面の脆弱性を明らかにする。私たちのコードはhttps://anonymous.4open.science/r/SAHAで公開されています。

論文の概要: Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

関連論文リスト