Fugu-MT 論文翻訳(概要): Self-Destructive Language Model

論文の概要: Self-Destructive Language Model

arxiv url: http://arxiv.org/abs/2505.12186v1
Date: Sun, 18 May 2025 01:08:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-20 14:57:11.079532
Title: Self-Destructive Language Model
Title（参考訳）: 自己破壊型言語モデル
Authors: Yuhui Wang, Rongyi Zhu, Ting Wang,
Abstract要約: 有害な微調整攻撃は、大規模言語モデル(LLM)のセキュリティに大きな脅威をもたらす本報告では,LEMを自己破壊モデルに変換するアライメント・エンハンス・ディフェンスであるSEAMについて紹介する。
参考スコア（独自算出の注目度）: 13.808746955144771
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Harmful fine-tuning attacks pose a major threat to the security of large language models (LLMs), allowing adversaries to compromise safety guardrails with minimal harmful data. While existing defenses attempt to reinforce LLM alignment, they fail to address models' inherent "trainability" on harmful data, leaving them vulnerable to stronger attacks with increased learning rates or larger harmful datasets. To overcome this critical limitation, we introduce SEAM, a novel alignment-enhancing defense that transforms LLMs into self-destructive models with intrinsic resilience to misalignment attempts. Specifically, these models retain their capabilities for legitimate tasks while exhibiting substantial performance degradation when fine-tuned on harmful data. The protection is achieved through a novel loss function that couples the optimization trajectories of benign and harmful data, enhanced with adversarial gradient ascent to amplify the self-destructive effect. To enable practical training, we develop an efficient Hessian-free gradient estimate with theoretical error bounds. Extensive evaluation across LLMs and datasets demonstrates that SEAM creates a no-win situation for adversaries: the self-destructive models achieve state-of-the-art robustness against low-intensity attacks and undergo catastrophic performance collapse under high-intensity attacks, rendering them effectively unusable. (warning: this paper contains potentially harmful content generated by LLMs.)
Abstract（参考訳）: 有害な微調整攻撃は、大規模言語モデル(LLM)のセキュリティに大きな脅威をもたらし、敵は最小限の有害データで安全ガードレールを妥協することができる。既存の防衛はLSMアライメントを強化しようとするが、有害なデータに対するモデル固有の「トレーニング可能性」に対処できず、学習率の上昇やより大きな有害データセットによる攻撃に弱いままである。この限界を克服するため,本質的なレジリエンスを持つ自己破壊モデルにLLMを変換する新しいアライメント・エンハンス・ディフェンスであるSEAMを導入する。具体的には、これらのモデルは、有害なデータに微調整された場合、相当なパフォーマンス劣化を示しながら、正当なタスクの能力を保っている。この保護は、良性および有害なデータの最適化軌跡を結合させ、対向勾配を上昇させて自己破壊効果を増幅する新規な損失関数によって達成される。実用的なトレーニングを実現するため,理論誤差境界を持つヘッセンフリー勾配推定法を開発した。自己破壊モデルは、低強度攻撃に対する最先端の堅牢性を達成し、高強度攻撃の下で破滅的なパフォーマンス崩壊を実行し、効果的に使用不能にする。 (注意:この論文はLLMが生み出す潜在的有害な内容を含んでいる。)

論文の概要: Self-Destructive Language Model

関連論文リスト