Fugu-MT 論文翻訳(概要): Understanding the Dilemma of Unlearning for Large Language Models

論文の概要: Understanding the Dilemma of Unlearning for Large Language Models

arxiv url: http://arxiv.org/abs/2509.24675v1
Date: Mon, 29 Sep 2025 12:15:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.96541
Title: Understanding the Dilemma of Unlearning for Large Language Models
Title（参考訳）: 大規模言語モデルにおける未学習のジレンマを理解する
Authors: Qingjie Zhang, Haoting Qian, Zhicong Huang, Cheng Hong, Minlie Huang, Ke Xu, Chao Zhang, Han Qiu,
Abstract要約: Unlearningは、大きな言語モデル(LLM)から特定の知識を取り除こうとしている。提案するunPactは,帰納的帰属とコントリビューショントラッキングによるアンラーニングのための解釈可能なフレームワークである。
参考スコア（独自算出の注目度）: 50.54260066313032
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unlearning seeks to remove specific knowledge from large language models (LLMs), but its effectiveness remains contested. On one side, "forgotten" knowledge can often be recovered through interventions such as light fine-tuning; on the other side, unlearning may induce catastrophic forgetting that degrades general capabilities. Despite active exploration of unlearning methods, interpretability analyses of the mechanism are scarce due to the difficulty of tracing knowledge in LLMs' complex architectures. We address this gap by proposing unPact, an interpretable framework for unlearning via prompt attribution and contribution tracking. Typically, it quantifies each prompt token's influence on outputs, enabling pre- and post-unlearning comparisons to reveal what changes. Across six mainstream unlearning methods, three LLMs, and three benchmarks, we find that: (1) Unlearning appears to be effective by disrupting focus on keywords in prompt; (2) Much of the knowledge is not truly erased and can be recovered by simply emphasizing these keywords in prompts, without modifying the model's weights; (3) Catastrophic forgetting arises from indiscriminate penalization of all tokens. Taken together, our results suggest an unlearning dilemma: existing methods tend either to be insufficient - knowledge remains recoverable by keyword emphasis, or overly destructive - general performance collapses due to catastrophic forgetting, still leaving a gap to reliable unlearning.
Abstract（参考訳）: Unlearningは、大きな言語モデル(LLM)から特定の知識を取り除こうとしているが、その効果はいまだに議論されている。一方、"忘れられた"知識は、軽微な調整などの介入によって回復されることが多く、一方、非学習は、一般的な能力を低下させる破滅的な忘れを招きかねない。未学習の手法を積極的に探究しているにもかかわらず、LLMの複雑なアーキテクチャにおける知識の追跡が困難であるため、メカニズムの解釈可能性の分析は少ない。このギャップに対処するためには、プロンプト属性とコントリビューショントラッキングによるアンラーニングのための解釈可能なフレームワークであるunPactを提案する。典型的には、各プロンプトトークンの出力への影響を定量化し、前と後の比較を可能とし、変更点を明らかにする。 6つの主流なアンラーニング手法,3つのLSM,3つのベンチマークにおいて,(1)未学習は即時キーワードの焦点を乱すことによって有効であるように見える;(2)知識の多くは真に消去されず,モデルの重みを変更せずに,単にこれらのキーワードをプロンプトで強調することで回復することができる;(3) 破滅的な忘れは,すべてのトークンの無差別なペナルティ化から生じる。既存の手法では、キーワードの強調によって知識が回復可能であるか、あるいは破滅的な忘れ込みによる一般的なパフォーマンスの崩壊が相変わらず、信頼できる未学習のギャップを残している。

論文の概要: Understanding the Dilemma of Unlearning for Large Language Models

関連論文リスト