Fugu-MT 論文翻訳(概要): LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics

論文の概要: LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics

arxiv url: http://arxiv.org/abs/2510.07626v1
Date: Wed, 08 Oct 2025 23:47:05 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:14.779194
Title: LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics
Title（参考訳）: 顕微鏡下でのLLMのアンラーニング:メソッドとメトリクスのフルスタックビュー
Authors: Chongyu Fan, Changsheng Wang, Yancheng Huang, Soumyadeep Pal, Sijia Liu,
Abstract要約: 本稿では,近年のステートフル・アンラーニング法12の原則的分類について述べる。未学習効果(UE)、実用性維持(UT)、堅牢性(Rob)の評価を再考する。分析の結果,Multiple-choice question (MCQ) の精度に支配される現在の評価は,狭い視点しか示さないことがわかった。
参考スコア（独自算出の注目度）: 10.638045151201084
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Machine unlearning for large language models (LLMs) aims to remove undesired data, knowledge, and behaviors (e.g., for safety, privacy, or copyright) while preserving useful model capabilities. Despite rapid progress over the past two years, research in LLM unlearning remains fragmented, with limited clarity on what constitutes effective unlearning and how it should be rigorously evaluated. In this work, we present a principled taxonomy of twelve recent stateful unlearning methods, grouped into three methodological families: divergence-driven optimization, representation misalignment, and rejection-based targeted unlearning. Building on this taxonomy, we revisit the evaluation of unlearning effectiveness (UE), utility retention (UT), and robustness (Rob), focusing on the WMDP benchmark. Our analysis shows that current evaluations, dominated by multiple-choice question (MCQ) accuracy, offer only a narrow perspective, often overstating success while overlooking the model's actual generation behavior. To address this gap, we introduce open question-answering (Open-QA) metrics that better capture generative performance and reveal the inherent UE-UT tradeoff across method families. Furthermore, we demonstrate that robustness requires finer-grained analysis: for example, vulnerabilities differ substantially between in-domain relearning and out-of-domain fine-tuning, even though both fall under model-level attacks. Through this study, we hope to deliver a full-stack revisit of LLM unlearning and actionable guidance for designing and evaluating future methods.
Abstract（参考訳）: 大規模言語モデル(LLM)のための機械学習は、有用なモデル機能を保持しながら、望ましくないデータ、知識、行動(例えば、安全、プライバシー、著作権)を削除することを目的としている。過去2年間の急速な進歩にもかかわらず、LLMアンラーニングの研究は断片化され続けており、効果的なアンラーニングを構成するものや、どのように厳格に評価されるべきかが限定されている。本研究では,近年の12のステートフル・アンラーニング手法の原則的分類を,分散化による最適化,表現の不適応,拒否に基づく未学習の3つの方法論に分類する。この分類に基づいて、WMDPベンチマークに焦点をあて、未学習の有効性(UE)、実用性維持(UT)、堅牢性(Rob)の評価を再考する。分析の結果,MCQ(Multiple-choice Question)の精度に支配される現在の評価は,狭い視点しか提供せず,モデルが生成する実際の振る舞いを見落としながら,しばしば成功を過大評価していることがわかった。このギャップに対処するために、生成性能をよりよく把握し、メソッドファミリ間の固有なUE-UTトレードオフを明らかにするオープンQA(Open QA)メトリクスを導入します。さらに、ロバストネスはよりきめ細かい分析を必要とすることを実証する。例えば、モデルレベルの攻撃を受けたとしても、ドメイン内の再学習とドメイン外の微調整の間に、脆弱性はかなり異なる。本研究は,今後の手法を設計・評価する上で,LLMの非学習的かつ実用的なガイダンスをフルスタックで提供することを目的としている。

論文の概要: LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics

関連論文リスト