Fugu-MT 論文翻訳(概要): Dual-Space Smoothness for Robust and Balanced LLM Unlearning

論文の概要: Dual-Space Smoothness for Robust and Balanced LLM Unlearning

arxiv url: http://arxiv.org/abs/2509.23362v1
Date: Sat, 27 Sep 2025 15:20:37 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.181808
Title: Dual-Space Smoothness for Robust and Balanced LLM Unlearning
Title（参考訳）: 頑健かつバランスのとれたLLMアンラーニングのための二重空間平滑性
Authors: Han Yan, Zheyuan Liu, Meng Jiang,
Abstract要約: PRISMは、非学習メトリクスを改善しバランスをとるために、表現空間とパラメータ空間の二重空間の滑らかさを強制する統合フレームワークである。 PRISMは2つのスムーズな最適化段階から構成される: (i) 堅牢に訓練されたプローブを用いてジェイルブレイク攻撃を防御する表現空間ステージ、 (ii) 保持と忘れられた勾配の衝突を分離し、不均衡を減らし、再学習攻撃を軽減するためにパラメータ空間を滑らかにするパラメータ空間ステージ。
参考スコア（独自算出の注目度）: 28.56156017984944
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: With the rapid advancement of large language models, Machine Unlearning has emerged to address growing concerns around user privacy, copyright infringement, and overall safety. Yet state-of-the-art (SOTA) unlearning methods often suffer from catastrophic forgetting and metric imbalance, for example by over-optimizing one objective (e.g., unlearning effectiveness, utility preservation, or privacy protection) at the expense of others. In addition, small perturbations in the representation or parameter space can be exploited by relearn and jailbreak attacks. To address these challenges, we propose PRISM, a unified framework that enforces dual-space smoothness in representation and parameter spaces to improve robustness and balance unlearning metrics. PRISM consists of two smoothness optimization stages: (i) a representation space stage that employs a robustly trained probe to defend against jailbreak attacks, and (ii) a parameter-space stage that decouples retain-forget gradient conflicts, reduces imbalance, and smooths the parameter space to mitigate relearning attacks. Extensive experiments on WMDP and MUSE, across conversational-dialogue and continuous-text settings, show that PRISM outperforms SOTA baselines under multiple attacks while achieving a better balance among key metrics.
Abstract（参考訳）: 大規模言語モデルの急速な進歩により、マシンラーニングはユーザのプライバシや著作権侵害、全体的な安全性に関する懸念の高まりに対処し始めている。しかし、最先端の未学習法(SOTA)は、例えば1つの目的(例えば、未学習の有効性、ユーティリティ保護、プライバシ保護など)を犠牲にして過度に最適化することで、破滅的な忘れ込みと計量不均衡に悩まされることが多い。さらに、表現空間やパラメータ空間における小さな摂動は、リリーン攻撃やジェイルブレイク攻撃によって悪用される。これらの課題に対処するために,表現空間とパラメータ空間における二重空間の滑らかさを強制し,ロバスト性を改善し,未学習のメトリクスのバランスをとる統一フレームワークであるPRISMを提案する。 PRISMは2つの滑らか度最適化段階から構成される。一脱獄攻撃を防ぎ、堅固に訓練された探究を施した表現空間の段階 (i) パラメータ空間の段階において、保持と隠蔽の対立を分離し、不均衡を減らし、再学習攻撃を緩和するためにパラメータ空間を円滑にする。 WMDPとMUSEに関する大規模な実験は、会話対話と連続したテキスト設定を通じて、PRISMが複数の攻撃下でSOTAベースラインを上回り、主要な指標間のバランスを良くすることを示している。

論文の概要: Dual-Space Smoothness for Robust and Balanced LLM Unlearning

関連論文リスト