Neural Breadcrumbs: Membership Inference Attacks on LLMs Through Hidden State and Attention Pattern Analysis
- URL: http://arxiv.org/abs/2509.05449v1
- Date: Fri, 05 Sep 2025 19:05:49 GMT
- Title: Neural Breadcrumbs: Membership Inference Attacks on LLMs Through Hidden State and Attention Pattern Analysis
- Authors: Disha Makhija, Manoj Ghuhan Arivazhagan, Vinayshekhar Bannihatti Kumar, Rashmi Gangadharaiah,
- Abstract summary: Membership inference attacks (MIAs) reveal whether specific data was used to train machine learning models.<n>Our work explores how examining internal representations, rather than just their outputs, may provide additional insights into potential membership inference signals.<n>Our findings suggest that internal model behaviors can reveal aspects of training data exposure even when output-based signals appear protected.
- Score: 9.529147118376464
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Membership inference attacks (MIAs) reveal whether specific data was used to train machine learning models, serving as important tools for privacy auditing and compliance assessment. Recent studies have reported that MIAs perform only marginally better than random guessing against large language models, suggesting that modern pre-training approaches with massive datasets may be free from privacy leakage risks. Our work offers a complementary perspective to these findings by exploring how examining LLMs' internal representations, rather than just their outputs, may provide additional insights into potential membership inference signals. Our framework, \emph{memTrace}, follows what we call \enquote{neural breadcrumbs} extracting informative signals from transformer hidden states and attention patterns as they process candidate sequences. By analyzing layer-wise representation dynamics, attention distribution characteristics, and cross-layer transition patterns, we detect potential memorization fingerprints that traditional loss-based approaches may not capture. This approach yields strong membership detection across several model families achieving average AUC scores of 0.85 on popular MIA benchmarks. Our findings suggest that internal model behaviors can reveal aspects of training data exposure even when output-based signals appear protected, highlighting the need for further research into membership privacy and the development of more robust privacy-preserving training techniques for large language models.
Related papers
- Noise as a Probe: Membership Inference Attacks on Diffusion Models Leveraging Initial Noise [51.179816451161635]
Diffusion models have achieved remarkable progress in image generation, but their increasing deployment raises serious concerns about privacy.<n>In this work, we utilize a critical yet overlooked vulnerability: the widely used noise schedules fail to fully eliminate semantic information in the images.<n>We propose a simple yet effective membership inference attack, which injects semantic information into the initial noise and infers membership by analyzing the model's generation result.
arXiv Detail & Related papers (2026-01-29T12:29:01Z) - AttenMIA: LLM Membership Inference Attack through Attention Signals [8.170623979629953]
We introduce AttenMIA, a new MIA framework that exploits self-attention patterns inside the transformer model to infer membership.<n>We show that attention-based features consistently outperform baselines, particularly under the important low-false-positive metric.<n>We also show that using AttenMIA to replace other membership inference attacks in a data extraction framework results in training data extraction attacks that outperform the state of the art.
arXiv Detail & Related papers (2026-01-26T03:45:56Z) - Exposing and Defending Membership Leakage in Vulnerability Prediction Models [13.905375956316632]
Membership Inference Attacks (MIAs) aim to infer whether a particular code sample was used during training.<n>Noise-based Membership Inference Defense (NMID) is a lightweight defense module that applies output masking and Gaussian noise injection to disrupt adversarial inference.<n>Our study highlights critical privacy risks in code analysis and offers actionable defense strategies for securing AI-powered software systems.
arXiv Detail & Related papers (2025-12-09T06:40:51Z) - P-MIA: A Profiled-Based Membership Inference Attack on Cognitive Diagnosis Models [22.027021891488683]
This paper is the first to systematically investigate membership inference attacks (MIA) against cognitive diagnosis models (CDMs)<n>We introduce a novel and realistic grey box threat model that exploits the explainability features of these platforms.<n>We propose a profile-based MIA framework that leverages both the model's final prediction probabilities and the exposed internal knowledge state vectors as features.
arXiv Detail & Related papers (2025-11-06T01:53:04Z) - The Hidden Cost of Modeling P(X): Vulnerability to Membership Inference Attacks in Generative Text Classifiers [6.542294761666199]
Membership Inference Attacks (MIAs) pose a critical privacy threat by enabling adversaries to determine whether a specific sample was included in a model's training dataset.<n>We show that fully generative classifiers which explicitly model the joint likelihood $P(X,Y)$ are most vulnerable to membership leakage.
arXiv Detail & Related papers (2025-10-17T18:09:33Z) - Large Language Model Sourcing: A Survey [84.63438376832471]
Large language models (LLMs) have revolutionized artificial intelligence, shifting from supporting objective tasks to empowering subjective decision-making.<n>Due to the black-box nature of LLMs and the human-like quality of their generated content, issues such as hallucinations, bias, unfairness, and copyright infringement become significant.<n>This survey presents a systematic investigation into provenance tracking for content generated by LLMs, organized around four interrelated dimensions.
arXiv Detail & Related papers (2025-10-11T10:52:30Z) - On the MIA Vulnerability Gap Between Private GANs and Diffusion Models [51.53790101362898]
Generative Adversarial Networks (GANs) and diffusion models have emerged as leading approaches for high-quality image synthesis.<n>We present the first unified theoretical and empirical analysis of the privacy risks faced by differentially private generative models.
arXiv Detail & Related papers (2025-09-03T14:18:22Z) - When Better Features Mean Greater Risks: The Performance-Privacy Trade-Off in Contrastive Learning [9.660010886245155]
This paper systematically investigates the privacy threats posed by membership inference attacks (MIAs) targeting encoder models.<n>We propose a novel membership inference attack method based on the p-norm of feature vectors, termed the Embedding Lp-Norm Likelihood Attack (LpLA)
arXiv Detail & Related papers (2025-06-06T05:03:29Z) - Memorization or Interpolation ? Detecting LLM Memorization through Input Perturbation Analysis [8.725781605542675]
Large Language Models (LLMs) achieve remarkable performance through training on massive datasets.<n>LLMs can exhibit concerning behaviors such as verbatim reproduction of training data rather than true generalization.<n>This paper introduces PEARL, a novel approach for detecting memorization in LLMs.
arXiv Detail & Related papers (2025-05-05T20:42:34Z) - EM-MIAs: Enhancing Membership Inference Attacks in Large Language Models through Ensemble Modeling [2.494935495983421]
This paper proposes a novel ensemble attack method that integrates several existing MIAs techniques into an XGBoost-based model to enhance overall attack performance (EM-MIAs)<n> Experimental results demonstrate that the ensemble model significantly improves both AUC-ROC and accuracy compared to individual attack methods across various large language models and datasets.
arXiv Detail & Related papers (2024-12-23T03:47:54Z) - Detecting Training Data of Large Language Models via Expectation Maximization [62.28028046993391]
We introduce EM-MIA, a novel membership inference method that iteratively refines membership scores and prefix scores via an expectation-maximization algorithm.<n> EM-MIA achieves state-of-the-art results on WikiMIA.
arXiv Detail & Related papers (2024-10-10T03:31:16Z) - Noisy Neighbors: Efficient membership inference attacks against LLMs [2.666596421430287]
This paper introduces an efficient methodology that generates textitnoisy neighbors for a target sample by adding noise in the embedding space.
Our findings demonstrate that this approach closely matches the effectiveness of employing shadow models, showing its usability in practical privacy auditing scenarios.
arXiv Detail & Related papers (2024-06-24T12:02:20Z) - Improving Input-label Mapping with Demonstration Replay for In-context
Learning [67.57288926736923]
In-context learning (ICL) is an emerging capability of large autoregressive language models.
We propose a novel ICL method called Sliding Causal Attention (RdSca)
We show that our method significantly improves the input-label mapping in ICL demonstrations.
arXiv Detail & Related papers (2023-10-30T14:29:41Z) - Assessing Privacy Risks in Language Models: A Case Study on
Summarization Tasks [65.21536453075275]
We focus on the summarization task and investigate the membership inference (MI) attack.
We exploit text similarity and the model's resistance to document modifications as potential MI signals.
We discuss several safeguards for training summarization models to protect against MI attacks and discuss the inherent trade-off between privacy and utility.
arXiv Detail & Related papers (2023-10-20T05:44:39Z) - Unleashing Mask: Explore the Intrinsic Out-of-Distribution Detection
Capability [70.72426887518517]
Out-of-distribution (OOD) detection is an indispensable aspect of secure AI when deploying machine learning models in real-world applications.
We propose a novel method, Unleashing Mask, which aims to restore the OOD discriminative capabilities of the well-trained model with ID data.
Our method utilizes a mask to figure out the memorized atypical samples, and then finetune the model or prune it with the introduced mask to forget them.
arXiv Detail & Related papers (2023-06-06T14:23:34Z) - Enhancing Multiple Reliability Measures via Nuisance-extended
Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition.
We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training.
We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.