In-Context Probing for Membership Inference in Fine-Tuned Language Models
- URL: http://arxiv.org/abs/2512.16292v2
- Date: Sun, 21 Dec 2025 20:55:37 GMT
- Title: In-Context Probing for Membership Inference in Fine-Tuned Language Models
- Authors: Zhexi Lu, Hongliang Chi, Nathalie Baracaldo, Swanand Ravindra Kadhe, Yuseok Jeon, Lei Yu,
- Abstract summary: Membership inference attacks (MIAs) pose a critical privacy threat to fine-tuned large language models (LLMs)<n>We propose ICP-MIA, a novel MIA framework grounded in the theory of training dynamics.<n>ICP-MIA significantly outperforms prior black-box MIAs, particularly at low false positive rates.
- Score: 14.590625376049955
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Membership inference attacks (MIAs) pose a critical privacy threat to fine-tuned large language models (LLMs), especially when models are adapted to domain-specific tasks using sensitive data. While prior black-box MIA techniques rely on confidence scores or token likelihoods, these signals are often entangled with a sample's intrinsic properties - such as content difficulty or rarity - leading to poor generalization and low signal-to-noise ratios. In this paper, we propose ICP-MIA, a novel MIA framework grounded in the theory of training dynamics, particularly the phenomenon of diminishing returns during optimization. We introduce the Optimization Gap as a fundamental signal of membership: at convergence, member samples exhibit minimal remaining loss-reduction potential, while non-members retain significant potential for further optimization. To estimate this gap in a black-box setting, we propose In-Context Probing (ICP), a training-free method that simulates fine-tuning-like behavior via strategically constructed input contexts. We propose two probing strategies: reference-data-based (using semantically similar public samples) and self-perturbation (via masking or generation). Experiments on three tasks and multiple LLMs show that ICP-MIA significantly outperforms prior black-box MIAs, particularly at low false positive rates. We further analyze how reference data alignment, model type, PEFT configurations, and training schedules affect attack effectiveness. Our findings establish ICP-MIA as a practical and theoretically grounded framework for auditing privacy risks in deployed LLMs.
Related papers
- What Hard Tokens Reveal: Exploiting Low-confidence Tokens for Membership Inference Attacks against Large Language Models [2.621142288968429]
Membership Inference Attacks (MIAs) attempt to determine whether a specific data sample was included in a model training/fine-tuning dataset.<n>We propose a novel membership inference approach that captures the token-level probabilities for low-confidence (hard) tokens.<n>Experiments on both domain-specific medical datasets and general-purpose benchmarks demonstrate that HT-MIA consistently outperforms seven state-of-the-art MIA baselines.
arXiv Detail & Related papers (2026-01-27T22:31:10Z) - PerProb: Indirectly Evaluating Memorization in Large Language Models [13.905375956316632]
We propose PerProb, a label-free framework for indirectly assessing LLM vulnerabilities.<n>PerProb evaluates changes in perplexity and average log probability between data generated by victim and adversary models.<n>We evaluate PerProb's effectiveness across five datasets, revealing varying memory behaviors and privacy risks.
arXiv Detail & Related papers (2025-12-16T17:10:01Z) - On the Effectiveness of Membership Inference in Targeted Data Extraction from Large Language Models [3.1988753364712115]
Large Language Models (LLMs) are prone to mem- orizing training data, which poses serious privacy risks.<n>In this study, we integrate multiple MIA techniques into the data extraction pipeline to systematically benchmark their effectiveness.
arXiv Detail & Related papers (2025-12-15T14:05:49Z) - Exposing and Defending Membership Leakage in Vulnerability Prediction Models [13.905375956316632]
Membership Inference Attacks (MIAs) aim to infer whether a particular code sample was used during training.<n>Noise-based Membership Inference Defense (NMID) is a lightweight defense module that applies output masking and Gaussian noise injection to disrupt adversarial inference.<n>Our study highlights critical privacy risks in code analysis and offers actionable defense strategies for securing AI-powered software systems.
arXiv Detail & Related papers (2025-12-09T06:40:51Z) - Neural Breadcrumbs: Membership Inference Attacks on LLMs Through Hidden State and Attention Pattern Analysis [9.529147118376464]
Membership inference attacks (MIAs) reveal whether specific data was used to train machine learning models.<n>Our work explores how examining internal representations, rather than just their outputs, may provide additional insights into potential membership inference signals.<n>Our findings suggest that internal model behaviors can reveal aspects of training data exposure even when output-based signals appear protected.
arXiv Detail & Related papers (2025-09-05T19:05:49Z) - Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance.<n>We introduce novel algorithms for dynamic, instance-level data reweighting.<n>Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z) - Learn from Downstream and Be Yourself in Multimodal Large Language Model Fine-Tuning [104.27224674122313]
Fine-tuning MLLM has become a common practice to improve performance on specific downstream tasks.
To balance the trade-off between generalization and specialization, we propose measuring the parameter importance for both pre-trained and fine-tuning distributions.
arXiv Detail & Related papers (2024-11-17T01:16:37Z) - Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more computation-efficient metric for performance estimation.<n>We present FLP-M, a fundamental approach for performance prediction that addresses the practical need to integrate datasets from multiple sources during pre-training.
arXiv Detail & Related papers (2024-10-11T04:57:48Z) - Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate [118.37653302885607]
We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs)
MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results.
arXiv Detail & Related papers (2024-10-09T17:59:04Z) - R-SFLLM: Jamming Resilient Framework for Split Federated Learning with Large Language Models [65.04475956174959]
Split federated learning (SFL) is a compute-efficient paradigm in distributed machine learning (ML)<n>A significant challenge in SFL, particularly when deployed over wireless channels, is the susceptibility of transmitted model parameters to adversarial jamming.<n>This paper develops a physical layer framework for resilient SFL with large language models (LLMs) and vision language models (VLMs) over wireless networks.
arXiv Detail & Related papers (2024-07-16T12:21:29Z) - Noisy Neighbors: Efficient membership inference attacks against LLMs [2.666596421430287]
This paper introduces an efficient methodology that generates textitnoisy neighbors for a target sample by adding noise in the embedding space.
Our findings demonstrate that this approach closely matches the effectiveness of employing shadow models, showing its usability in practical privacy auditing scenarios.
arXiv Detail & Related papers (2024-06-24T12:02:20Z) - Towards Robust Federated Learning via Logits Calibration on Non-IID Data [49.286558007937856]
Federated learning (FL) is a privacy-preserving distributed management framework based on collaborative model training of distributed devices in edge networks.
Recent studies have shown that FL is vulnerable to adversarial examples, leading to a significant drop in its performance.
In this work, we adopt the adversarial training (AT) framework to improve the robustness of FL models against adversarial example (AE) attacks.
arXiv Detail & Related papers (2024-03-05T09:18:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.