Fugu-MT 論文翻訳(概要): Explainable LLM Unlearning Through Reasoning

論文の概要: Explainable LLM Unlearning Through Reasoning

arxiv url: http://arxiv.org/abs/2603.09980v1
Date: Sun, 08 Feb 2026 06:33:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-15 16:38:22.518687
Title: Explainable LLM Unlearning Through Reasoning
Title（参考訳）: 推論による説明可能なLLMの学習
Authors: Junfeng Liao, Qizhou Wang, Shanshan Ye, Xin Yu, Ling Chen, Zhen Fang,
Abstract要約: これらの問題は、モデルがいつ、どのように解放されるべきかについて、明確なガイダンスがないことに起因する、と私たちは主張する。本研究では,学習対象の特定範囲と学習後応答を満足する新たな未学習対象,推論に基づく未学習対象を提案する。一般的な能力を維持しながら、より信頼性の高いアンラーニングを実現していることがわかった。
参考スコア（独自算出の注目度）: 23.128471655470793
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM unlearning is essential for mitigating safety, copyright, and privacy concerns in pre-trained large language models (LLMs). Compared to preference alignment, it offers a more explicit way by removing undesirable knowledge characterized by specific unlearning datasets. In previous works, gradient ascent (GA) and its variants have shown promise for implementing unlearning, yet their untargeted nature results in unintended degradation of general capabilities, incomplete removal of knowledge, and the generation of incoherent responses, among many others. We argue that these issues stem from the absence of explicit guidance on what and how models should unlearn. To fill this gap, we introduce a novel unlearning target, reasoning-based unlearning target, which satisfies both the specified unlearning scope and the specified post-unlearning response. Building on this, we propose targeted reasoning unlearning (TRU), which leverages reasoning-based unlearning target as guidance. We employ the target using a cross-entropy supervised loss combined with a GA-based loss, enabling the model to learn reasoning ability for precise knowledge removal while preserving unrelated abilities. We evaluate TRU against strong baselines across multiple benchmarks and LLM backbones, and find that it achieves more reliable unlearning while preserving general capabilities. Moreover, TRU exhibits superior robustness under diverse attack scenarios, stemming from the reasoning ability learned through reasoning-based targets. Overall, our study establishes reasoning-augmented unlearning as a practical paradigm for reliable and explainable LLM unlearning.
Abstract（参考訳）: LLMアンラーニングは、事前訓練された大規模言語モデル(LLM)の安全性、著作権、プライバシーの懸念を軽減するために不可欠である。選好アライメントと比較して、特定の未学習データセットによって特徴づけられる望ましくない知識を取り除くことで、より明確な方法を提供する。これまでの研究では、勾配上昇(GA)とその変種は、未学習の実装を約束するが、その目的のない性質は、意図しない一般的な能力の劣化、知識の不完全除去、一貫性のない応答の生成などをもたらす。これらの問題は、モデルがいつ、どのように解放されるべきかについて、明確なガイダンスがないことに起因する、と私たちは主張する。このギャップを埋めるために、我々は、特定の未学習範囲と特定未学習後の応答の両方を満たす新しい未学習ターゲット、推論に基づく未学習ターゲットを導入する。そこで本研究では、推論に基づく未学習目標をガイダンスとして活用する、対象推論未学習(TRU)を提案する。我々は、GAに基づく損失と組み合わせたクロスエントロピー制御による損失を用いて、モデルが無関係な能力を保ちながら、正確な知識除去のための推論能力を学ぶことができるようにした。複数のベンチマークやLLMバックボーンの強いベースラインに対して,TRUを評価し,汎用性を保ちながら,より信頼性の高いアンラーニングを実現する。さらに、TRUは多種多様な攻撃シナリオにおいて優れた堅牢性を示し、推論に基づく標的から学んだ推論能力から導かれる。本研究は、信頼性と説明性を備えたLLMアンラーニングのための実践的パラダイムとして、推論強化アンラーニングを確立した。

論文の概要: Explainable LLM Unlearning Through Reasoning

関連論文リスト