Fugu-MT 論文翻訳(概要): Sparse-Autoencoder-Guided Internal Representation Unlearning for Large Language Models

論文の概要: Sparse-Autoencoder-Guided Internal Representation Unlearning for Large Language Models

arxiv url: http://arxiv.org/abs/2509.15631v1
Date: Fri, 19 Sep 2025 05:48:12 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-22 18:18:11.016085
Title: Sparse-Autoencoder-Guided Internal Representation Unlearning for Large Language Models
Title（参考訳）: Sparse-Autoencoder-Guided Internal Representation Unlearning for Large Language Models
Authors: Tomoya Yamashita, Akira Ito, Yuuki Yamanaka, Masanori Yamada, Takayuki Miura, Toshiki Shibahara,
Abstract要約: モデルの内部アクティベーションに介入する新しいアンラーニング手法を提案する。「対象の内的活性化を未知の実体と整合させることで、対象の実体の認識を「未知」から「未知」にシフトさせる。」本手法は,非対象知識に大きなダメージを与えることなく,質問応答タスクにおける対象知識のリコールを効果的に削減する。
参考スコア（独自算出の注目度）: 8.590330924532092
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) are increasingly deployed across various applications, privacy and copyright concerns have heightened the need for more effective LLM unlearning techniques. Many existing unlearning methods aim to suppress undesirable outputs through additional training (e.g., gradient ascent), which reduces the probability of generating such outputs. While such suppression-based approaches can control model outputs, they may not eliminate the underlying knowledge embedded in the model's internal activations; muting a response is not the same as forgetting it. Moreover, such suppression-based methods often suffer from model collapse. To address these issues, we propose a novel unlearning method that directly intervenes in the model's internal activations. In our formulation, forgetting is defined as a state in which the activation of a forgotten target is indistinguishable from that of ``unknown'' entities. Our method introduces an unlearning objective that modifies the activation of the target entity away from those of known entities and toward those of unknown entities in a sparse autoencoder latent space. By aligning the target's internal activation with those of unknown entities, we shift the model's recognition of the target entity from ``known'' to ``unknown'', achieving genuine forgetting while avoiding over-suppression and model collapse. Empirically, we show that our method effectively aligns the internal activations of the forgotten target, a result that the suppression-based approaches do not reliably achieve. Additionally, our method effectively reduces the model's recall of target knowledge in question-answering tasks without significant damage to the non-target knowledge.
Abstract（参考訳）: 大規模言語モデル(LLM)が様々なアプリケーションにまたがって展開されるにつれて、プライバシーと著作権に関する懸念が、より効果的なLLMアンラーニング技術の必要性を高めている。既存の未学習手法の多くは、追加のトレーニング(例えば勾配上昇)を通じて望ましくない出力を抑えることを目的としており、そのような出力を生成する確率を減少させる。このような抑制ベースのアプローチはモデル出力を制御することができるが、モデルの内部アクティベーションに埋め込まれた基礎的な知識を排除しないかもしれない。さらに、このような抑制に基づく手法は、しばしばモデル崩壊に悩まされる。これらの問題に対処するために,モデルの内部アクティベーションに直接介入する新しいアンラーニング手法を提案する。我々の定式化では、忘れられたターゲットのアクティベーションが ``unknown'' エンティティと区別できない状態として定義されている。本手法では,未知のエンティティに対して,未知のエンティティに対して,スパースオートエンコーダの潜在空間において,対象エンティティのアクティベーションを変更する未学習目的を導入する。ターゲットの内部のアクティベーションを未知のエンティティのアクティベートと整合させることで、対象エンティティの認識を ``known'' から ``unknown'' にシフトし、過剰なプレッシャーやモデルの崩壊を回避しつつ、真に忘れることを実現する。実験により, 提案手法は, 忘れられた対象の内的活性化を効果的に調整し, 抑制に基づくアプローチが確実に達成できないことを示す。さらに,本手法は,非対象知識に大きなダメージを与えることなく,質問応答タスクにおける対象知識のリコールを効果的に削減する。

論文の概要: Sparse-Autoencoder-Guided Internal Representation Unlearning for Large Language Models

関連論文リスト