Fugu-MT 論文翻訳(概要): SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks

論文の概要: SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks

arxiv url: http://arxiv.org/abs/2508.15182v1
Date: Thu, 21 Aug 2025 02:39:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-22 16:26:46.150783
Title: SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks
Title（参考訳）: SafeLLM: 大規模言語モデルによる脱獄攻撃に対する有害なアウトプットの学習
Authors: Xiangman Li, Xiaodong Wu, Qi Li, Jianbing Ni, Rongxing Lu,
Abstract要約: ジェイルブレイク攻撃は、大規模言語モデルの安全性に深刻な脅威をもたらす。我々は,新しい非学習型防衛フレームワークであるSafeLLMを提案する。 SafeLLMは高い汎用性能を維持しながら攻撃成功率を大幅に低下させることを示す。
参考スコア（独自算出の注目度）: 29.963044242980345
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Jailbreak attacks pose a serious threat to the safety of Large Language Models (LLMs) by crafting adversarial prompts that bypass alignment mechanisms, causing the models to produce harmful, restricted, or biased content. In this paper, we propose SafeLLM, a novel unlearning-based defense framework that unlearn the harmful knowledge from LLMs while preserving linguistic fluency and general capabilities. SafeLLM employs a three-stage pipeline: (1) dynamic unsafe output detection using a hybrid approach that integrates external classifiers with model-internal evaluations; (2) token-level harmful content tracing through feedforward network (FFN) activations to localize harmful knowledge; and (3) constrained optimization to suppress unsafe behavior without degrading overall model quality. SafeLLM achieves targeted and irreversible forgetting by identifying and neutralizing FFN substructures responsible for harmful generation pathways. Extensive experiments on prominent LLMs (Vicuna, LLaMA, and GPT-J) across multiple jailbreak benchmarks show that SafeLLM substantially reduces attack success rates while maintaining high general-purpose performance. Compared to standard defense methods such as supervised fine-tuning and direct preference optimization, SafeLLM offers stronger safety guarantees, more precise control over harmful behavior, and greater robustness to unseen attacks. Moreover, SafeLLM maintains the general performance after the harmful knowledge unlearned. These results highlight unlearning as a promising direction for scalable and effective LLM safety.
Abstract（参考訳）: 大規模な言語モデル(LLM)の安全性に対して、ジェイルブレイク攻撃は、アライメントメカニズムを回避し、モデルが有害で制限された、あるいはバイアスのあるコンテンツを生成させるという敵のプロンプトを作れば深刻な脅威となる。本稿では,LLMから有害な知識を解放し,言語流布や汎用性を保ちつつ,新たな非学習型防衛フレームワークであるSafeLLMを提案する。 SafeLLMは,(1)外部分類器とモデル内部評価を統合したハイブリッドアプローチを用いた動的安全でない出力検出,(2)フィードフォワードネットワーク(FFN)アクティベートによる有害なコンテンツのトレースによる有害な知識のローカライズ,(3)モデル品質の劣化を伴わない安全でない動作の抑制のための制約付き最適化,という3段階のパイプラインを採用している。 SafeLLMは、有害な生成経路に関与するFFNサブ構造を同定し、中和することにより、標的的かつ不可逆的な忘れ方を実現する。複数のジェイルブレイクベンチマークにおける顕著なLLM(Vicuna, LLaMA, GPT-J)の大規模な実験により、SafeLLMは高い汎用性能を維持しながら攻撃成功率を大幅に低下させることが示された。監督された微調整や直接選好最適化などの標準的な防御手法と比較して、SafeLLMはより強力な安全保証、有害な行動に対するより正確な制御、そして目に見えない攻撃に対する堅牢性を提供する。さらに、SafeLLMは有害な知識が漏れた後、一般的なパフォーマンスを維持している。これらの結果は、スケーラブルで効果的なLLM安全性のための有望な方向性として、アンラーニングを強調している。

論文の概要: SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks

関連論文リスト