Fugu-MT 論文翻訳(概要): Understanding the Effects of Safety Unalignment on Large Language Models

論文の概要: Understanding the Effects of Safety Unalignment on Large Language Models

arxiv url: http://arxiv.org/abs/2604.02574v1
Date: Thu, 02 Apr 2026 23:09:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 17:20:24.240773
Title: Understanding the Effects of Safety Unalignment on Large Language Models
Title（参考訳）: 大規模言語モデルにおける安全アライメントの効果の理解
Authors: John T. Halloran,
Abstract要約: そこで本研究では,様々なサイズの6つのLLMが,悪意ある,良心的なタスクに及ぼした影響について検討する。 JTとは対照的に、WOアンアライメントモデルの大多数は幻覚の傾向が低く、元の自然言語のパフォーマンスをより良く保ち、最先端の敵攻撃やサイバー攻撃においてより効果的である。
参考スコア（独自算出の注目度）: 0.5076419064097732
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Safety alignment has become a critical step to ensure LLMs refuse harmful requests while providing helpful and harmless responses. However, despite the ubiquity of safety alignment for deployed frontier models, two separate lines of recent work--jailbreak-tuning (JT) and weight orthogonalization (WO)--have shown that safety guardrails may be largely disabled, resulting in LLMs which comply with harmful requests they would normally refuse. In spite of far-reaching safety implications, analysis has largely been limited to refusal rates of each unalignment method in isolation, leaving their relative effects on adversarial LLM capabilities unknown. To fill this gap, we study the impact of unaligning six popular LLMs of various sizes across a large number of malicious and benign tasks, using both JT and WO. Across the evaluated models, we show that while refusal degradation is split between the two methods, WO produces LLMs far more capable of aiding in malicious activity; in contrast to JT, the majority of WO unaligned models are far less prone to hallucinations, better retain their original natural-language performance, and are more effective at state-of-the-art adversarial and cyber attacks. To thus help mitigate the malicious risks of WO unalignment, we conclude by showing that supervised fine-tuning effectively limits the adversarial attack abilities enabled by WO, without drastically affecting hallucination rates or natural language performance.
Abstract（参考訳）: 安全アライメントは、LLMが有害な要求を拒否し、有用で無害な応答を提供するための重要なステップとなっている。しかしながら、配備されたフロンティアモデルに対する安全アライメントの多様さにもかかわらず、最近の2つの作業ライン、すなわちジェイルブレイクチューニング(JT)とウェイト直交化(WO)は、安全ガードレールが大部分が無効であることを示し、その結果、通常拒否される有害な要求に準拠するLLMが実現した。広範囲にわたる安全性への影響にもかかわらず、分析は大半が無調整法の拒絶率に限られており、敵のLDM能力に対する相対的な影響は分かっていない。このギャップを埋めるために, JT と WO の両方を用いて, 多数の悪意ある, 良心的なタスクにまたがる, 様々なサイズの6つの人気のある LLM をアンアライン化することの影響について検討した。 JTとは対照的に、WOの非整合モデルの大多数は幻覚の傾向が低く、本来の自然言語性能を保ち、最先端の敵攻撃やサイバー攻撃においてより効果的である。そこで我々は,WOアンアライメントの悪意あるリスクを軽減するために,教師による微調整が,幻覚率や自然言語性能に大きな影響を及ぼすことなく,WOによって実現される敵の攻撃能力を効果的に制限することを示し,その結論を導いた。

論文の概要: Understanding the Effects of Safety Unalignment on Large Language Models

関連論文リスト