Blackbox Model Provenance via Palimpsestic Membership Inference
- URL: http://arxiv.org/abs/2510.19796v1
- Date: Wed, 22 Oct 2025 17:30:39 GMT
- Title: Blackbox Model Provenance via Palimpsestic Membership Inference
- Authors: Rohith Kuditipudi, Jing Huang, Sally Zhu, Diyi Yang, Christopher Potts, Percy Liang,
- Abstract summary: Let's say Alice trains an open-weight language model and Bob uses a blackbox derivative of Alice's model to produce text.<n>Can Alice prove that Bob is using her model, either by querying Bob's derivative model or from the text alone?<n>We use test statistics that capture correlation between Bob's model or text and the ordering of training examples in Alice's training run.
- Score: 96.73342141272549
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Suppose Alice trains an open-weight language model and Bob uses a blackbox derivative of Alice's model to produce text. Can Alice prove that Bob is using her model, either by querying Bob's derivative model (query setting) or from the text alone (observational setting)? We formulate this question as an independence testing problem--in which the null hypothesis is that Bob's model or text is independent of Alice's randomized training run--and investigate it through the lens of palimpsestic memorization in language models: models are more likely to memorize data seen later in training, so we can test whether Bob is using Alice's model using test statistics that capture correlation between Bob's model or text and the ordering of training examples in Alice's training run. If Alice has randomly shuffled her training data, then any significant correlation amounts to exactly quantifiable statistical evidence against the null hypothesis, regardless of the composition of Alice's training data. In the query setting, we directly estimate (via prompting) the likelihood Bob's model gives to Alice's training examples and order; we correlate the likelihoods of over 40 fine-tunes of various Pythia and OLMo base models ranging from 1B to 12B parameters with the base model's training data order, achieving a p-value on the order of at most 1e-8 in all but six cases. In the observational setting, we try two approaches based on estimating 1) the likelihood of Bob's text overlapping with spans of Alice's training examples and 2) the likelihood of Bob's text with respect to different versions of Alice's model we obtain by repeating the last phase (e.g., 1%) of her training run on reshuffled data. The second approach can reliably distinguish Bob's text from as little as a few hundred tokens; the first does not involve any retraining but requires many more tokens (several hundred thousand) to achieve high power.
Related papers
- Extracting alignment data in open models [50.81383232591576]
We show that it is possible to extract significant amounts of alignment training data from a post-trained model.<n>This data is useful to steer the model to improve certain capabilities such as long-context reasoning, safety, instruction following, and maths.<n>We find that models readily regurgitate training data that was used in post-training phases such as SFT or RL.
arXiv Detail & Related papers (2025-10-21T12:06:00Z) - Pretraining Language Models to Ponder in Continuous Space [50.52734567589996]
We introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step.<n>We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations.
arXiv Detail & Related papers (2025-05-27T03:47:33Z) - An empirical study of task and feature correlations in the reuse of pre-trained models [1.0128808054306186]
Pre-trained neural networks are commonly used and reused in the machine learning community.<n>This paper introduces an experimental setup through which factors contributing to Bob's empirical success could be studied in silico.<n>We show in controlled real-world scenarios that Bob can effectively reuse Alice's pre-trained network if there are semantic correlations between his and Alice's task.
arXiv Detail & Related papers (2025-05-15T22:51:27Z) - Erasing Without Remembering: Implicit Knowledge Forgetting in Large Language Models [81.62767292169225]
We investigate knowledge forgetting in large language models with a focus on its generalisation.<n>We propose PerMU, a novel probability perturbation-based unlearning paradigm.<n>Experiments are conducted on a diverse range of datasets, including TOFU, Harry Potter, ZsRE, WMDP, and MUSE.
arXiv Detail & Related papers (2025-02-27T11:03:33Z) - Independence Tests for Language Models [47.0749292650885]
Given the weights of two models, can we test whether they were trained independently?<n>We consider two settings: constrained and unconstrained.<n>We propose a new test which matches hidden activations between two models, and which is robust to adversarial transformations and to changes in model architecture.
arXiv Detail & Related papers (2025-02-17T20:01:08Z) - The Query/Hit Model for Sequential Hypothesis Testing [8.242194776558895]
This work introduces the Query/Hit (Q/H) learning model.<n>One agent, Alice, has access to a streaming source, while the other, Bob, does not have direct access to the source.<n> Communication occurs through sequential Q/H pairs: Bob sends a sequence of source symbols (queries), and Alice responds with the waiting time until each query appears in the source stream (hits)
arXiv Detail & Related papers (2025-02-02T00:23:28Z) - Additive-Effect Assisted Learning [17.408937094829007]
We develop a two-stage assisted learning architecture for an agent, Alice, to seek assistance from another agent, Bob.
In the first stage, we propose a privacy-aware hypothesis testing-based screening method for Alice to decide on the usefulness of the data from Bob.
We show that Alice can achieve the oracle performance as if the training were from centralized data, both theoretically and numerically.
arXiv Detail & Related papers (2024-05-13T23:24:25Z) - Offline Reinforcement Learning for Human-Guided Human-Machine
Interaction with Private Information [110.42866062614912]
We study human-guided human-machine interaction involving private information.
We focus on offline reinforcement learning (RL) in this game.
We develop a novel identification result and use it to propose a new off-policy evaluation method.
arXiv Detail & Related papers (2022-12-23T06:26:44Z) - Identifying the value of a random variable unambiguously: Quantum versus classical approaches [44.99833362998488]
Quantum resources may provide advantage over their classical counterparts.
We construct such a task based on a game, mediated by Referee and played between Alice and Bob.
We show that if Alice sends limited amount of classical information then the game cannot be won while the quantum analogue of the limited amount of classical information' is sufficient for winning the game.
arXiv Detail & Related papers (2022-11-16T20:28:49Z) - Asymmetric self-play for automatic goal discovery in robotic
manipulation [12.573331269520077]
We rely on asymmetric self-play for goal discovery, where two agents, Alice and Bob, play a game.
We show that this method can discover highly diverse and complex goals without any human priors.
Our method scales, resulting in a single policy that can generalize to many unseen tasks.
arXiv Detail & Related papers (2021-01-13T05:20:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.