Fugu-MT 論文翻訳(概要): Output Vector Editing for Memorization Mitigation in Large Language Models

論文の概要: Output Vector Editing for Memorization Mitigation in Large Language Models

arxiv url: http://arxiv.org/abs/2606.18767v1
Date: Wed, 17 Jun 2026 07:29:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:51.043229
Title: Output Vector Editing for Memorization Mitigation in Large Language Models
Title（参考訳）: 大規模言語モデルにおけるメモリ化緩和のための出力ベクトル編集
Authors: Ahmad Dawar Hakimi, Kaiwei Lei, Isabelle Augenstein, Hinrich Schütze,
Abstract要約: 大規模な言語モデルは、トレーニングデータからシーケンスを記憶し、再現し、プライバシ、著作権、セキュリティリスクを生み出す。既存のニューロンレベルの緩和方法は、ニューロンの活性化をゼロにすることで編集を等しくするが、活性化はニューロンが関与するかどうかのみを制御する。記憶継続に責任を負うニューロンの小さな集合を探索する制約最適化編集である出力ベクトル編集を提案する。
参考スコア（独自算出の注目度）: 68.30351930772788
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models memorize and reproduce sequences from their training data, creating privacy, copyright, and security risks. Existing neuron-level mitigation methods equate editing with zeroing out neuron activations, but the activation only controls whether a neuron engages; the output vector is what writes to the residual stream and, through superposition, encodes multiple features. We propose output vector editing, a constrained-optimization weight edit that locates a small set of MLP neurons responsible for a memorized continuation and minimally modifies their output vectors to introduce a distractor in vocabulary space, redirecting their residual-stream contributions while leaving activations unchanged. Evaluating on four models from 360M to 7B parameters (SmolLM-360M, OLMo-1B, OLMo-7B, Llama2-7B), we center on OLMo-7B (whose open weights and pretraining corpus enable systematic mining) and mine 6831 memorized sequences, achieving up to 87.9% suppression. The 2.7$\times$ gap over zero ablation on the same located neurons shows the suppression comes from the output-vector edit, not localization alone. Four edit modes span a spectrum from aggressive suppression to minimal redirection; in ensemble they cover 96.5% of memorized sequences, while our recommended single-mode configuration reaches 81.5% with no catastrophic locality failures. We further identify a mechanistic boundary at ${\sim}14%$ of sequences unreachable by MLP-only editing; while these failures are not attention-driven overall, ablating the top contributing attention heads recovers 60--64% of them, with stronger recovery on continuations that copy tokens from the prefix, positioning attention as a complementary fallback rather than a primary mechanism. Edit mode ordering and the success-locality trade-off transfer across all four models, with success rates scaling with model size rather than family.
Abstract（参考訳）: 大規模な言語モデルは、トレーニングデータからシーケンスを記憶し、再現し、プライバシ、著作権、セキュリティリスクを作成する。既存のニューロンレベルの緩和法は、神経細胞の活性化をゼロにすることで編集を等しくするが、活性化はニューロンが関与するかどうかを制御しているだけであり、出力ベクトルは残留ストリームに書き込むもので、重ね合わせによって複数の特徴をコード化する。本稿では,記憶継続に責任を負うMLPニューロンの小さな集合を探索し,その出力ベクトルを最小限に調整して語彙空間に散逸器を導入し,残ストリームの寄与をリダイレクトし,アクティベーションを一定に保ったままにしておくことを目的とした,制約最適化重み編集である出力ベクトル編集を提案する。 360Mから7Bパラメータ (SmolLM-360M, OLMo-1B, OLMo-7B, Llama2-7B) の4つのモデルの評価を行い, OLMo-7B(オープンウェイトとプレトレーニングコーパスが体系的なマイニングを可能にする)と6831の暗記シーケンスを抽出し, 最大87.9%の抑制を実現した。同じ位置にあるニューロン上のゼロアブレーションに対する2.7$\times$ギャップは、ローカライゼーションのみではなく、出力ベクトル編集による抑制であることを示している。 4つの編集モードは、攻撃的な抑制から最小限のリダイレクトまでのスペクトルにまたがっており、アンサンブルでは96.5%の暗記シーケンスをカバーし、我々の推奨シングルモード構成は81.5%に達し、壊滅的な局所性障害は発生しない。さらに、MLPのみの編集で到達不能なシーケンスの${\sim}14%のメカニカルバウンダリを識別するが、これらの障害は注意駆動によるものではなく、トップコントリビューションのアテンションヘッドが60～64%を回復し、プレフィックスからトークンをコピーする継続を強く回復し、プライマリメカニズムよりも補完的なフォールバックとして注意を向ける。編集モードの順序付けと、成功-局所性のトレードオフは、4つのモデルすべてで行われ、成功率は家族ではなくモデルサイズでスケーリングされる。

論文の概要: Output Vector Editing for Memorization Mitigation in Large Language Models

関連論文リスト