Fugu-MT 論文翻訳(概要): Exposing the Illusion of Erasure in Knowledge Editing for LLMs

論文の概要: Exposing the Illusion of Erasure in Knowledge Editing for LLMs

arxiv url: http://arxiv.org/abs/2606.23276v1
Date: Mon, 22 Jun 2026 12:53:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 22:54:28.509863
Title: Exposing the Illusion of Erasure in Knowledge Editing for LLMs
Title（参考訳）: LLMの知識編集における消去のイライラ
Authors: Advik Raj Basani, Anshuman Chhabra,
Abstract要約: 我々は,知識編集(KE)が,LLMにおける特定の事実を,コストのかかる再学習なしに更新するためのフロンティアとして登場したことを示す。低ランク更新は既存の知識を上書きするのではなく、モデル表現空間内で再配布することを示す。損失景観の分析により、編集された知識は摂動に非常に敏感な狭く異方性のある領域にあることが明らかになった。
参考スコア（独自算出の注目度）: 8.788531432978802
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Knowledge Editing (KE) has emerged as a frontier for updating specific facts in LLMs without costly retraining, but its reliability and underlying mechanisms remain poorly understood. In this work, we examine KE from an adversarial elicitation perspective, revealing that edited knowledge is often not fully erased and continues to surface, with consistent failures observed across diverse model architectures. To explain this behavior, we conduct a mechanistic analysis of popular KE methods. We show that low-rank updates do not overwrite existing knowledge but instead redistribute it within the model's representation space. Furthermore, we find that these methods act as targeted suppression mechanisms that reduce the likelihood of expressing original facts, rather than removing them from the model. Analysis of the loss landscape reveals that edited knowledge lies in narrow, anisotropic regions that are highly sensitive to perturbations, making them highly vulnerable to indirect prompting and adversarial attacks. By exposing these profound architectural vulnerabilities, our work proves that KE algorithms are inherently bypassable and motivates a fundamental reevaluation of how we deploy post-hoc updates in several LLM applications.
Abstract（参考訳）: 知識編集(KE)は、LLMの特定の事実を高価に再トレーニングすることなく更新するためのフロンティアとして登場したが、その信頼性と基盤となるメカニズムは未だよく分かっていない。本研究は,KEを敵対的推論の観点から検討し,編集された知識が完全に消去されず,様々なモデルアーキテクチャで一貫した失敗を伴って表面化され続けることを明らかにする。この振る舞いを説明するために、一般的なKE手法の力学解析を行う。低ランク更新は既存の知識を上書きするのではなく、モデル表現空間内で再配布することを示す。さらに, これらの手法は, モデルから除去するのではなく, 本来の事実を表現できる可能性を低減するために, 標的となる抑制機構として機能することがわかった。失われた風景の分析によると、編集された知識は摂動に非常に敏感な狭い異方性領域にあり、間接的な衝動や敵の攻撃に対して非常に脆弱である。これらの重大なアーキテクチャ上の脆弱性を明らかにすることで、KEアルゴリズムは本質的にバイパス可能であることを証明し、いくつかのLLMアプリケーションにポストホックアップデートをデプロイする方法の根本的な再評価を動機付けています。

論文の概要: Exposing the Illusion of Erasure in Knowledge Editing for LLMs

関連論文リスト