Related papers: AttenMIA: LLM Membership Inference Attack through Attention Signals

AttenMIA: LLM Membership Inference Attack through Attention Signals

URL: http://arxiv.org/abs/2601.18110v1
Date: Mon, 26 Jan 2026 03:45:56 GMT
Title: AttenMIA: LLM Membership Inference Attack through Attention Signals
Authors: Pedram Zaree, Md Abdullah Al Mamun, Yue Dong, Ihsen Alouani, Nael Abu-Ghazaleh,
Abstract summary: We introduce AttenMIA, a new MIA framework that exploits self-attention patterns inside the transformer model to infer membership.<n>We show that attention-based features consistently outperform baselines, particularly under the important low-false-positive metric.<n>We also show that using AttenMIA to replace other membership inference attacks in a data extraction framework results in training data extraction attacks that outperform the state of the art.
Score: 8.170623979629953
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are increasingly deployed to enable or improve a multitude of real-world applications. Given the large size of their training data sets, their tendency to memorize training data raises serious privacy and intellectual property concerns. A key threat is the membership inference attack (MIA), which aims to determine whether a given sample was included in the model's training set. Existing MIAs for LLMs rely primarily on output confidence scores or embedding-based features, but these signals are often brittle, leading to limited attack success. We introduce AttenMIA, a new MIA framework that exploits self-attention patterns inside the transformer model to infer membership. Attention controls the information flow within the transformer, exposing different patterns for memorization that can be used to identify members of the dataset. Our method uses information from attention heads across layers and combines them with perturbation-based divergence metrics to train an effective MIA classifier. Using extensive experiments on open-source models including LLaMA-2, Pythia, and Opt models, we show that attention-based features consistently outperform baselines, particularly under the important low-false-positive metric (e.g., achieving up to 0.996 ROC AUC & 87.9% TPR@1%FPR on the WikiMIA-32 benchmark with Llama2-13b). We show that attention signals generalize across datasets and architectures, and provide a layer- and head-level analysis of where membership leakage is most pronounced. We also show that using AttenMIA to replace other membership inference attacks in a data extraction framework results in training data extraction attacks that outperform the state of the art. Our findings reveal that attention mechanisms, originally introduced to enhance interpretability, can inadvertently amplify privacy risks in LLMs, underscoring the need for new defenses.

Related papers

What Hard Tokens Reveal: Exploiting Low-confidence Tokens for Membership Inference Attacks against Large Language Models [2.621142288968429]
Membership Inference Attacks (MIAs) attempt to determine whether a specific data sample was included in a model training/fine-tuning dataset.<n>We propose a novel membership inference approach that captures the token-level probabilities for low-confidence (hard) tokens.<n>Experiments on both domain-specific medical datasets and general-purpose benchmarks demonstrate that HT-MIA consistently outperforms seven state-of-the-art MIA baselines.
arXiv Detail & Related papers (2026-01-27T22:31:10Z)
Res-MIA: A Training-Free Resolution-Based Membership Inference Attack on Federated Learning Models [1.9336815376402718]
Membership inference attacks pose a serious threat to the privacy of machine learning models.<n>We introduce Res-MIA, a training-free and black-box membership inference attack.<n>We evaluate the proposed attack on a federated ResNet-18 trained on CIFAR-10.
arXiv Detail & Related papers (2026-01-24T08:58:39Z)
In-Context Probing for Membership Inference in Fine-Tuned Language Models [14.590625376049955]
Membership inference attacks (MIAs) pose a critical privacy threat to fine-tuned large language models (LLMs)<n>We propose ICP-MIA, a novel MIA framework grounded in the theory of training dynamics.<n>ICP-MIA significantly outperforms prior black-box MIAs, particularly at low false positive rates.
arXiv Detail & Related papers (2025-12-18T08:26:26Z)
(Token-Level) InfoRMIA: Stronger Membership Inference and Memorization Assessment for LLMs [13.601386341584545]
Large language models (LLMs) are now trained on nearly all available data, which amplifies the magnitude of information leakage.<n>Standard method to quantify privacy is via membership inference attacks.<n>We present InfoRMIA, a principled information-theoretic formulation of membership inference.
arXiv Detail & Related papers (2025-10-07T04:59:49Z)
Neural Breadcrumbs: Membership Inference Attacks on LLMs Through Hidden State and Attention Pattern Analysis [9.529147118376464]
Membership inference attacks (MIAs) reveal whether specific data was used to train machine learning models.<n>Our work explores how examining internal representations, rather than just their outputs, may provide additional insights into potential membership inference signals.<n>Our findings suggest that internal model behaviors can reveal aspects of training data exposure even when output-based signals appear protected.
arXiv Detail & Related papers (2025-09-05T19:05:49Z)
Attention Tracker: Detecting Prompt Injection Attacks in LLMs [62.247841717696765]
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks.<n>We introduce the concept of the distraction effect, where specific attention heads shift focus from the original instruction to the injected instruction.<n>We propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks.
arXiv Detail & Related papers (2024-11-01T04:05:59Z)
Detecting Training Data of Large Language Models via Expectation Maximization [62.28028046993391]
We introduce EM-MIA, a novel membership inference method that iteratively refines membership scores and prefix scores via an expectation-maximization algorithm.<n> EM-MIA achieves state-of-the-art results on WikiMIA.
arXiv Detail & Related papers (2024-10-10T03:31:16Z)
ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods [56.073335779595475]
We propose ReCaLL (Relative Conditional Log-Likelihood) to detect pretraining data by leveraging conditional language modeling capabilities.<n>Our empirical findings show that conditioning member data on non-member prefixes induces a larger decrease in log-likelihood compared to non-member data.<n>We conduct comprehensive experiments and show that ReCaLL achieves state-of-the-art performance on the WikiMIA dataset.
arXiv Detail & Related papers (2024-06-23T00:23:13Z)
Do Membership Inference Attacks Work on Large Language Models? [141.2019867466968]
Membership inference attacks (MIAs) attempt to predict whether a particular datapoint is a member of a target model's training data. We perform a large-scale evaluation of MIAs over a suite of language models trained on the Pile, ranging from 160M to 12B parameters. We find that MIAs barely outperform random guessing for most settings across varying LLM sizes and domains.
arXiv Detail & Related papers (2024-02-12T17:52:05Z)
RelaxLoss: Defending Membership Inference Attacks without Losing Utility [68.48117818874155]
We propose a novel training framework based on a relaxed loss with a more achievable learning target. RelaxLoss is applicable to any classification model with added benefits of easy implementation and negligible overhead. Our approach consistently outperforms state-of-the-art defense mechanisms in terms of resilience against MIAs.
arXiv Detail & Related papers (2022-07-12T19:34:47Z)
How Does Data Augmentation Affect Privacy in Machine Learning? [94.52721115660626]
We propose new MI attacks to utilize the information of augmented data. We establish the optimal membership inference when the model is trained with augmented data.
arXiv Detail & Related papers (2020-07-21T02:21:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.