Fugu-MT 論文翻訳(概要): Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check

論文の概要: Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check

arxiv url: http://arxiv.org/abs/2510.12981v1
Date: Tue, 14 Oct 2025 20:50:30 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-16 20:13:28.418049
Title: Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check
Title（参考訳）: 事実を隠せるレファレンスな未学習のメトリクス
Authors: Sungjun Cho, Dasol Hwang, Frederic Sala, Sangheum Hwang, Kyunghyun Cho, Sungmin Cha,
Abstract要約: 本研究では,非学習モデルと参照モデル間の分布類似度を測定する新しい尺度であるFADE(Functional Alignment for Distributional Equivalence)を提案する。 FADEは出力分布全体の機能的アライメントをキャプチャし、真の未学習の原則的評価を提供する。これらの知見は、現在の評価実践における根本的なギャップを明らかにし、FADEが真に効果的な未学習手法を開発し評価するための、より堅牢な基盤を提供することを示した。
参考スコア（独自算出の注目度）: 60.77691669644931
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current unlearning metrics for generative models evaluate success based on reference responses or classifier outputs rather than assessing the core objective: whether the unlearned model behaves indistinguishably from a model that never saw the unwanted data. This reference-specific approach creates systematic blind spots, allowing models to appear successful while retaining unwanted knowledge accessible through alternative prompts or attacks. We address these limitations by proposing Functional Alignment for Distributional Equivalence (FADE), a novel metric that measures distributional similarity between unlearned and reference models by comparing bidirectional likelihood assignments over generated samples. Unlike existing approaches that rely on predetermined references, FADE captures functional alignment across the entire output distribution, providing a principled assessment of genuine unlearning. Our experiments on the TOFU benchmark for LLM unlearning and the UnlearnCanvas benchmark for text-to-image diffusion model unlearning reveal that methods achieving near-optimal scores on traditional metrics fail to achieve distributional equivalence, with many becoming more distant from the gold standard than before unlearning. These findings expose fundamental gaps in current evaluation practices and demonstrate that FADE provides a more robust foundation for developing and assessing truly effective unlearning methods.
Abstract（参考訳）: 生成モデルに対する現在の未学習のメトリクスは、コア目標を評価するのではなく、参照応答や分類器出力に基づいて成功を評価する。この参照固有のアプローチは、系統的な盲点を生成し、代替のプロンプトやアタックを通じてアクセス可能な不要な知識を維持しながら、モデルが成功する。提案するFADE(Functional Alignment for Distributional Equivalence)は,非学習モデルと参照モデルとの分布類似度を,生成したサンプルに対して双方向の確率割り当てを比較することで測定する手法である。所定の基準に依存する既存のアプローチとは異なり、FADEは出力分布全体の機能的アライメントをキャプチャし、真の未学習を原則的に評価する。 LLMアンラーニングのためのTOFUベンチマークとアンラーニングのためのUnlearnCanvasベンチマークによる実験により、従来のメトリクスのほぼ最適スコアを得る手法は分布同値を達成できず、多くが非ラーニングよりもゴールドスタンダードから遠ざかっていることが明らかとなった。これらの知見は、現在の評価実践における根本的なギャップを明らかにし、FADEが真に効果的な未学習手法を開発し評価するための、より堅牢な基盤を提供することを示した。

論文の概要: Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check

関連論文リスト