Fugu-MT 論文翻訳(概要): G-Drift MIA: Membership Inference via Gradient-Induced Feature Drift in LLMs

論文の概要: G-Drift MIA: Membership Inference via Gradient-Induced Feature Drift in LLMs

arxiv url: http://arxiv.org/abs/2604.00419v1
Date: Wed, 01 Apr 2026 03:01:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-02 16:44:31.809789
Title: G-Drift MIA: Membership Inference via Gradient-Induced Feature Drift in LLMs
Title（参考訳）: GドリフトMIA:LLMにおけるグラディエント誘起特徴ドリフトによるメンバーシップ推論
Authors: Ravi Ranjan, Utkarsh Grover, Xiaomin Lin, Agoritsa Polyzou,
Abstract要約: メンバーシップ推論攻撃(MIA)は、トレーニング中に特定の例が使用されたかどうかを判断することを目的としている。 G-Drift MIAは,勾配誘起特徴量ドリフトに基づくホワイトボックスメンバシップ推論手法である。 G-Driftは信頼性ベース、パープレキシティベース、参照ベースアタックを大きく上回る。
参考スコア（独自算出の注目度）: 1.8986796884429726
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are trained on massive web-scale corpora, raising growing concerns about privacy and copyright. Membership inference attacks (MIAs) aim to determine whether a given example was used during training. Existing LLM MIAs largely rely on output probabilities or loss values and often perform only marginally better than random guessing when members and non-members are drawn from the same distribution. We introduce G-Drift MIA, a white-box membership inference method based on gradient-induced feature drift. Given a candidate (x,y), we apply a single targeted gradient-ascent step that increases its loss and measure the resulting changes in internal representations, including logits, hidden-layer activations, and projections onto fixed feature directions, before and after the update. These drift signals are used to train a lightweight logistic classifier that effectively separates members from non-members. Across multiple transformer-based LLMs and datasets derived from realistic MIA benchmarks, G-Drift substantially outperforms confidence-based, perplexity-based, and reference-based attacks. We further show that memorized training samples systematically exhibit smaller and more structured feature drift than non-members, providing a mechanistic link between gradient geometry, representation stability, and memorization. In general, our results demonstrate that small, controlled gradient interventions offer a practical tool for auditing the membership of training-data and assessing privacy risks in LLMs.
Abstract（参考訳）: 大きな言語モデル(LLM)は大規模なWebスケールコーパスでトレーニングされており、プライバシや著作権に関する懸念が高まっている。メンバーシップ推論攻撃(MIA)は、トレーニング中に特定の例が使用されたかどうかを判断することを目的としている。既存のLLM MIAは出力確率や損失値に大きく依存しており、メンバーと非メンバーが同じ分布から引き出されるときのランダムな推測よりもわずかに良い。 G-Drift MIAは,勾配誘起特徴量ドリフトに基づくホワイトボックスメンバシップ推論手法である。候補(x,y)が与えられた場合、その損失を増大させ、ロジット、隠れ層アクティベーション、固定された特徴方向への投影を含む内部表現の変化を、更新前後に計測する単一の目標勾配上昇ステップを適用する。これらのドリフト信号は、メンバーを非メンバーから効果的に分離する軽量なロジスティック分類器の訓練に使用される。複数のトランスフォーマーベースのLLMと、現実的なMIAベンチマークから派生したデータセットを通じて、G-Driftは、信頼性ベース、パープレキシティベース、参照ベースアタックを大幅に上回る。さらに、記憶されたトレーニングサンプルは、非メンバーよりも小さく、より構造的な特徴ドリフトを体系的に示し、勾配幾何学、表現安定性、記憶の力学的リンクを提供する。以上の結果から,LLMにおけるトレーニングデータの構成を監査し,プライバシリスクを評価するための実践的ツールとして,小規模で制御された勾配介入が有効であることが示唆された。

論文の概要: G-Drift MIA: Membership Inference via Gradient-Induced Feature Drift in LLMs

関連論文リスト