Fugu-MT 論文翻訳(概要): Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

論文の概要: Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

arxiv url: http://arxiv.org/abs/2605.16776v1
Date: Sat, 16 May 2026 03:15:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:47.017024
Title: Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning
Title（参考訳）: 識別不能な削除:大規模言語モデル学習における知識消去と拒絶の統一化
Authors: Puning Yang, Junchi Yu, Qizhou Wang, Philip Torr, Bo Han, Xiuying Chen,
Abstract要約: Distinguishable Deletion (mathrmD2$)は、特定のトークンではなく、潜在表現の応答分布を制限するパラダイムである。本稿では,知識の存在と未学習コンテンツと保持コンテンツとの分離を定量化するエネルギー指標を提案する。実験の結果、EUAは以前の方法よりも大幅に優れており、$mathrmD2$の優位性を示している。
参考スコア（独自算出の注目度）: 58.725080160369494
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mitigating sensitive and harmful outputs is fundamental to ensuring safe deployment of LLMs. Existing approaches typically follow two paradigms: Knowledge Deletion (KD), which erases undesirable information during training, and Distinguishable Refusal (DR), which steers models away from using sensitive knowledge during inference. Despite rapid progress, KD-based unlearning struggles with biased deletion due to suppressing specific token sequences as a substitute for complete knowledge removal, whereas DR-based unlearning risks the re-emergence of harmful knowledge because the underlying knowledge remains intact. To address these issues, we propose Distinguishable Deletion ($\mathrm{D^2}$), a paradigm that restricts the response distribution in the latent representation rather than specific tokens to erase undesirable knowledge, while distinguishing it from retained knowledge, enabling a refusal mechanism to handle unlearned inputs safely and coherently. To implement $\mathrm{D^2}$, we introduce an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content. Mathematical and empirical analyses show that energy is both accurate and efficient, enabling Energy-based Unlearning Alignment (EUA) to enforce energy-boundary unlearning during training and apply an energy-based refusal mechanism at inference. Extensive experiments demonstrate that EUA significantly outperforms previous methods, indicating the superiority of $\mathrm{D^2}$. Our code is available at https://github.com/Puning97/EUA-for-LLM-Unlearning.
Abstract（参考訳）: 機密かつ有害なアウトプットを緩和することは、LLMの安全な配置を保証するための基本となる。既存のアプローチは、トレーニング中に望ましくない情報を消去する知識削除(KD)と、推論中に繊細な知識を使用するモデルから遠ざかる識別可能な拒絶(DR)の2つのパラダイムに従うのが一般的である。急速な進歩にもかかわらず、KDベースの未学習は、完全な知識除去の代用として特定のトークンシーケンスを抑えるため、偏りのある削除に苦しむ一方、DRベースの未学習は、基礎となる知識がそのままであるため、有害な知識の再創出を危険にさらしている。これらの問題に対処するために、特定のトークンではなく潜在表現の応答分布を制限するパラダイムであるDistinguishable Deletion(\mathrm{D^2}$)を提案する。 $\mathrm{D^2}$を実装するために、知識の存在と未学習コンテンツと保持コンテンツの分離を定量化するエネルギー指標を導入する。数学的および経験的分析は、エネルギーが正確かつ効率的であることを示し、エネルギーベースの未学習調整(EUA)により、トレーニング中にエネルギー境界未学習を強制し、推論時にエネルギーベースの拒絶機構を適用することができる。大規模な実験では、EUAは以前の方法よりも大幅に優れており、$\mathrm{D^2}$の優越性を示している。私たちのコードはhttps://github.com/Puning97/EUA-for-LLM-Unlearningで利用可能です。

論文の概要: Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

関連論文リスト