Bayesian Inference of Training Dataset Membership
- URL: http://arxiv.org/abs/2506.00701v1
- Date: Sat, 31 May 2025 20:14:38 GMT
- Title: Bayesian Inference of Training Dataset Membership
- Authors: Yongchao Huang,
- Abstract summary: This paper proposes an efficient, interpretable and principled Bayesian inference method for membership inference.<n>By analyzing post-hoc metrics such as prediction error, confidence (entropy), magnitude, and dataset statistics from a trained ML model, our approach computes posterior probabilities of membership without requiring extensive model training.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Determining whether a dataset was part of a machine learning model's training data pool can reveal privacy vulnerabilities, a challenge often addressed through membership inference attacks (MIAs). Traditional MIAs typically require access to model internals or rely on computationally intensive shadow models. This paper proposes an efficient, interpretable and principled Bayesian inference method for membership inference. By analyzing post-hoc metrics such as prediction error, confidence (entropy), perturbation magnitude, and dataset statistics from a trained ML model, our approach computes posterior probabilities of membership without requiring extensive model training. Experimental results on synthetic datasets demonstrate the method's effectiveness in distinguishing member from non-member datasets. Beyond membership inference, this method can also detect distribution shifts, offering a practical and interpretable alternative to existing approaches.
Related papers
- Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks.<n>We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z) - Model-free Methods for Event History Analysis and Efficient Adjustment (PhD Thesis) [55.2480439325792]
This thesis is a series of independent contributions to statistics unified by a model-free perspective.<n>The first chapter elaborates on how a model-free perspective can be used to formulate flexible methods that leverage prediction techniques from machine learning.<n>The second chapter studies the concept of local independence, which describes whether the evolution of one process is directly influenced by another.
arXiv Detail & Related papers (2025-02-11T19:24:09Z) - Self-Comparison for Dataset-Level Membership Inference in Large (Vision-)Language Models [73.94175015918059]
We propose a dataset-level membership inference method based on Self-Comparison.
Our method does not require access to ground-truth member data or non-member data in identical distribution.
arXiv Detail & Related papers (2024-10-16T23:05:59Z) - Bayesian Estimation and Tuning-Free Rank Detection for Probability Mass Function Tensors [17.640500920466984]
This paper presents a novel framework for estimating the joint PMF and automatically inferring its rank from observed data.
We derive a deterministic solution based on variational inference (VI) to approximate the posterior distributions of various model parameters. Additionally, we develop a scalable version of the VI-based approach by leveraging variational inference (SVI)
Experiments involving both synthetic data and real movie recommendation data illustrate the advantages of our VI and SVI-based methods in terms of estimation accuracy, automatic rank detection, and computational efficiency.
arXiv Detail & Related papers (2024-10-08T20:07:49Z) - ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods [56.073335779595475]
We propose ReCaLL (Relative Conditional Log-Likelihood) to detect pretraining data by leveraging conditional language modeling capabilities.<n>Our empirical findings show that conditioning member data on non-member prefixes induces a larger decrease in log-likelihood compared to non-member data.<n>We conduct comprehensive experiments and show that ReCaLL achieves state-of-the-art performance on the WikiMIA dataset.
arXiv Detail & Related papers (2024-06-23T00:23:13Z) - Querying Easily Flip-flopped Samples for Deep Active Learning [63.62397322172216]
Active learning is a machine learning paradigm that aims to improve the performance of a model by strategically selecting and querying unlabeled data.
One effective selection strategy is to base it on the model's predictive uncertainty, which can be interpreted as a measure of how informative a sample is.
This paper proposes the it least disagree metric (LDM) as the smallest probability of disagreement of the predicted label.
arXiv Detail & Related papers (2024-01-18T08:12:23Z) - Towards Better Modeling with Missing Data: A Contrastive Learning-based
Visual Analytics Perspective [7.577040836988683]
Missing data can pose a challenge for machine learning (ML) modeling.
Current approaches are categorized into feature imputation and label prediction.
This study proposes a Contrastive Learning framework to model observed data with missing values.
arXiv Detail & Related papers (2023-09-18T13:16:24Z) - Can Membership Inferencing be Refuted? [31.31060116447964]
We study the reliability of membership inference attacks in practice.
We show that a model owner can plausibly refute the result of a membership inference test on a data point $x$ by constructing a proof of repudiation.
Our results call for a re-evaluation of the implications of membership inference attacks in practice.
arXiv Detail & Related papers (2023-03-07T04:36:35Z) - Robustness of Machine Learning Models Beyond Adversarial Attacks [0.0]
We show that the widely used concept of adversarial robustness and closely related metrics are not necessarily valid metrics for determining the robustness of ML models.
We propose a flexible approach that models possible perturbations in input data individually for each application.
This is then combined with a probabilistic approach that computes the likelihood that a real-world perturbation will change a prediction.
arXiv Detail & Related papers (2022-04-21T12:09:49Z) - MINIMALIST: Mutual INformatIon Maximization for Amortized Likelihood
Inference from Sampled Trajectories [61.3299263929289]
Simulation-based inference enables learning the parameters of a model even when its likelihood cannot be computed in practice.
One class of methods uses data simulated with different parameters to infer an amortized estimator for the likelihood-to-evidence ratio.
We show that this approach can be formulated in terms of mutual information between model parameters and simulated data.
arXiv Detail & Related papers (2021-06-03T12:59:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.