Fugu-MT 論文翻訳(概要): The Erasure Illusion: Stress-Testing the Generalization of LLM Forgetting Evaluation

論文の概要: The Erasure Illusion: Stress-Testing the Generalization of LLM Forgetting Evaluation

arxiv url: http://arxiv.org/abs/2512.19025v1
Date: Mon, 22 Dec 2025 04:42:41 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-23 18:54:32.620223
Title: The Erasure Illusion: Stress-Testing the Generalization of LLM Forgetting Evaluation
Title（参考訳）: 消去イリュージョン:ストレス試験によるLCM予測評価の一般化
Authors: Hengrui Jia, Taoran Li, Jonas Guan, Varun Chandrasekaran,
Abstract要約: 機械学習は、訓練されたモデルから特定のデータの影響を取り除くことを目的としている。現在のアンラーニングメトリクスは、特定のアンラーニングデータセットでモデルのパフォーマンス劣化を監視して成功を測定する。本稿では,サロゲートデータセットを生成する自動ストレステストフレームワーク, $tildeD_u$を提案する。
参考スコア（独自算出の注目度）: 15.252787015786796
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine unlearning aims to remove specific data influences from trained models, a capability essential for adhering to copyright laws and ensuring AI safety. Current unlearning metrics typically measure success by monitoring the model's performance degradation on the specific unlearning dataset ($D_u$). We argue that for Large Language Models (LLMs), this evaluation paradigm is insufficient and potentially misleading. Many real-world uses of unlearning--motivated by copyright or safety--implicitly target not only verbatim content in $D_u$, but also behaviors influenced by the broader generalizations the model derived from it. We demonstrate that LLMs can pass standard unlearning evaluation and appear to have ``forgotten'' the target knowledge, while simultaneously retaining strong capabilities on content that is semantically adjacent to $D_u$. This phenomenon indicates that erasing exact sentences does not necessarily equate to removing the underlying knowledge. To address this gap, we propose \name, an automated stress-testing framework that generates a surrogate dataset, $\tilde{D}_u$. This surrogate set is constructed to be semantically derived from $D_u$ yet sufficiently distinct in embedding space. By comparing unlearning metric scores between $D_u$ and $\tilde{D}_u$, we can stress-test the reliability of the metric itself. Our extensive evaluation across three LLM families (Llama-3-8B, Qwen2.5-7B, and Zephyr-7B-$β$), three distinct datasets, and seven standard metrics reveals widespread inconsistencies. We find that current metrics frequently overestimate unlearning success, failing to detect retained knowledge exposed by our stress-test datasets.
Abstract（参考訳）: 機械学習は、トレーニングされたモデルから特定のデータの影響を取り除くことを目的としている。現在のアンラーニングメトリクスは、通常、特定のアンラーニングデータセット(D_u$)でモデルのパフォーマンス劣化を監視することで成功を測定する。我々は,Large Language Models (LLMs) に対して,この評価パラダイムは不十分であり,誤解を招く可能性があると主張している。著作権や安全性によって動機づけられた非学習の現実的な利用は、単純に$D_u$の冗長なコンテンツだけでなく、モデルから派生したより広範な一般化の影響も受けている。我々は、LLMが標準の未学習評価をパスし、目標とする知識を 'forgotten' とみなし、同時に$D_u$にセマンティックに隣接したコンテンツに強い能力を保持できることを示した。この現象は、正確な文の消去が、根底にある知識を取り除くのに必ずしも一致するわけではないことを示唆している。このギャップに対処するために,サロゲートデータセットを生成する自動ストレステストフレームワークである \name, $\tilde{D}_u$ を提案する。この代理集合は、$D_u$から意味論的に導かれるように構成されるが、埋め込み空間では十分に異なる。未学習のメトリックスコアを$D_u$と$\tilde{D}_u$で比較することにより、メトリック自体の信頼性をストレステストすることができる。 LLMの3家系(Llama-3-8B, Qwen2.5-7B, Zephyr-7B-$β$)、3つの異なるデータセット、および7つの標準指標にまたがる広範囲な不整合が明らかとなった。現在の測定値では、未学習の成功を過大評価することが多く、ストレステストデータセットが公開している知識の保持を検出できないことが分かりました。

論文の概要: The Erasure Illusion: Stress-Testing the Generalization of LLM Forgetting Evaluation

関連論文リスト