Fugu-MT 論文翻訳(概要): The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning

論文の概要: The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning

arxiv url: http://arxiv.org/abs/2603.11266v1
Date: Wed, 11 Mar 2026 19:51:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:25.617648
Title: The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning
Title（参考訳）: アンラーニングミラージュ: LLMアンラーニングを評価するための動的フレームワーク
Authors: Raj Sanjay Shah, Jing Huang, Keerthiram Murugesan, Nathalie Baracaldo, Diyi Yang,
Abstract要約: 複雑な構造化クエリを用いたアンラーニングテストを強調する動的フレームワークを提案する。提案手法はまず,対象モデル(事前学習)から知識を抽出し,単純なクエリからマルチホップチェーンまで,対象プローブを構築する。本フレームワークは,テストセットを手作業で構築することなく,非学習手法の実用的でスケーラブルな評価を可能にする。
参考スコア（独自算出の注目度）: 54.67958855362658
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Unlearning in Large Language Models (LLMs) aims to enhance safety, mitigate biases, and comply with legal mandates, such as the right to be forgotten. However, existing unlearning methods are brittle: minor query modifications, such as multi-hop reasoning and entity aliasing, can recover supposedly forgotten information. As a result, current evaluation metrics often create an illusion of effectiveness, failing to detect these vulnerabilities due to reliance on static, unstructured benchmarks. We propose a dynamic framework that stress tests unlearning robustness using complex structured queries. Our approach first elicits knowledge from the target model (pre-unlearning) and constructs targeted probes, ranging from simple queries to multi-hop chains, allowing precise control over query difficulty. Our experiments show that the framework (1) shows comparable coverage to existing benchmarks by automatically generating semantically equivalent Q&A probes, (2) aligns with prior evaluations, and (3) uncovers new unlearning failures missed by other benchmarks, particularly in multi-hop settings. Furthermore, activation analyses show that single-hop queries typically follow dominant computation pathways, which are more likely to be disrupted by unlearning methods. In contrast, multi-hop queries tend to use alternative pathways that often remain intact, explaining the brittleness of unlearning techniques in multi-hop settings. Our framework enables practical and scalable evaluation of unlearning methods without the need for manual construction of forget test sets, enabling easier adoption for real-world applications. We release the pip package and the code at https://sites.google.com/view/unlearningmirage/home.
Abstract（参考訳）: Unlearning in Large Language Models (LLMs) は、安全性を高め、バイアスを軽減し、忘れられる権利のような法的義務に従うことを目的としている。マルチホップ推論やエンティティエイリアスのようなマイナーなクエリ修正は、忘れられたと思われる情報を復元することができる。その結果、現在の評価指標は、静的で非構造化のベンチマークに依存するため、これらの脆弱性を検出することができないため、有効性の錯覚を引き起こすことが多い。複雑な構造化クエリを用いて、未学習の堅牢性をテストする動的フレームワークを提案する。提案手法はまず,対象モデル(事前学習)から知識を抽出し,単純なクエリからマルチホップチェーンまでのターゲットプローブを構築し,クエリの難易度を正確に制御する。実験の結果,(1)は意味論的に等価なQ&Aプローブを自動生成し,(2)事前評価と整合し,(3)他のベンチマーク,特にマルチホップ設定で欠落した新たな未学習障害を明らかにすることにより,既存のベンチマークに匹敵するカバレッジを示した。さらに、アクティベーション分析は、シングルホップクエリが典型的に支配的な計算経路に従うことを示す。対照的に、マルチホップクエリは、しばしば無傷な代替経路を使う傾向があり、マルチホップ設定におけるアンラーニング技術の脆さを説明する。本フレームワークは,テストセットを手作業で構築することなく,非学習手法の実用的かつスケーラブルな評価を可能にする。 pipパッケージとコードはhttps://sites.google.com/view/unlearningmirage/home.comで公開しています。

論文の概要: The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning

関連論文リスト