Fugu-MT 論文翻訳(概要): Learning the Signature of Memorization in Autoregressive Language Models

論文の概要: Learning the Signature of Memorization in Autoregressive Language Models

arxiv url: http://arxiv.org/abs/2604.03199v1
Date: Fri, 03 Apr 2026 17:17:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 17:20:24.551231
Title: Learning the Signature of Memorization in Autoregressive Language Models
Title（参考訳）: 自己回帰型言語モデルにおける記憶のシグナチャの学習
Authors: David Ilić, Kostadin Cvejoski, David Stanojević, Evgeny Grigorenko,
Abstract要約: 我々は,任意のコーパス上の任意のモデルを微調整することで,ラベル付きデータを無制限に生成する,最初のトランスファー可能な学習攻撃を導入する。これにより、シャドーモデルボトルネックが取り除かれ、深層学習時代へのメンバシップ推論がもたらされる。
参考スコア（独自算出の注目度）: 3.6048665052465663
License: http://creativecommons.org/licenses/by/4.0/
Abstract: All prior membership inference attacks for fine-tuned language models use hand-crafted heuristics (e.g., loss thresholding, Min-K\%, reference calibration), each bounded by the designer's intuition. We introduce the first transferable learned attack, enabled by the observation that fine-tuning any model on any corpus yields unlimited labeled data, since membership is known by construction. This removes the shadow model bottleneck and brings membership inference into the deep learning era: learning what matters rather than designing it, with generalization through training diversity and scale. We discover that fine-tuning language models produces an invariant signature of memorization detectable across architectural families and data domains. We train a membership inference classifier exclusively on transformer-based models. It transfers zero-shot to Mamba (state-space), RWKV-4 (linear attention), and RecurrentGemma (gated recurrence), achieving 0.963, 0.972, and 0.936 AUC respectively. Each evaluation combines an architecture and dataset never seen during training, yet all three exceed performance on held-out transformers (0.908 AUC). These four families share no computational mechanisms, their only commonality is gradient descent on cross-entropy loss. Even simple likelihood-based methods exhibit strong transfer, confirming the signature exists independently of the detection method. Our method, Learned Transfer MIA (LT-MIA), captures this signal most effectively by reframing membership inference as sequence classification over per-token distributional statistics. On transformers, LT-MIA achieves 2.8$\times$ higher TPR at 0.1\% FPR than the strongest baseline. The method also transfers to code (0.865 AUC) despite training only on natural language texts. Code and trained classifier available at https://github.com/JetBrains-Research/learned-mia.
Abstract（参考訳）: 微調整言語モデルに対する以前のメンバーシップ推論攻撃は、手作りのヒューリスティック(例えば、損失閾値、Min-K\%、参照キャリブレーション)を使用しており、それぞれ設計者の直感に縛られている。本報告では,任意のコーパス上の任意のモデルを微調整すると,構成によってメンバシップが知られているため,ラベル付きデータを無制限に生成する,という観察により,最初のトランスファー可能な学習攻撃を導入する。これにより、シャドウモデルボトルネックが排除され、メンバーシップ推論がディープラーニング時代にもたらされる。微調整言語モデルは、アーキテクチャファミリやデータドメイン間で検出可能な記憶の不変シグネチャを生成する。我々は、変圧器モデルのみに基づく会員推論分類器を訓練する。ゼロショットをMamba(状態空間)、RWKV-4(線形注意)、RecurrentGemma(ゲート再発)に転送し、それぞれ0.963、0.972、0.936 AUCを達成している。それぞれの評価は、トレーニング中に見たことのないアーキテクチャとデータセットを組み合わせたものだが、いずれもホールドアウトトランス(0.908 AUC)のパフォーマンスを上回っている。これら4つの族は計算機構を共有しておらず、その唯一の共通性は交叉エントロピー損失の勾配降下である。単純な可能性に基づく方法でさえ強い伝達を示し、検出法とは独立に署名が存在することを確認する。提案手法であるLearned Transfer MIA (LT-MIA) は,この信号を最も効果的に捉える。変換器では、LT-MIAは最強ベースラインよりも0.1 % FPRで2.8$\times$高いTPRを達成する。この方法は、自然言語テキストのみの訓練にもかかわらず、コード(0.865 AUC)に転送する。コードとトレーニングされた分類器はhttps://github.com/JetBrains-Research/learned-miaで入手できる。

論文の概要: Learning the Signature of Memorization in Autoregressive Language Models

関連論文リスト