Data-Copying in Generative Models: A Formal Framework
- URL: http://arxiv.org/abs/2302.13181v1
- Date: Sat, 25 Feb 2023 22:31:01 GMT
- Title: Data-Copying in Generative Models: A Formal Framework
- Authors: Robi Bhattacharjee, Sanjoy Dasgupta, Kamalika Chaudhuri
- Abstract summary: A formal framework for memorization in generative models, called "data-copying," was proposed by Meehan et. al.
We build upon their work to show that their framework may fail to detect certain kinds of blatant memorization.
We provide a method to detect data-copying, and provably show that it works with high probability when enough data is available.
- Score: 34.84064423819405
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There has been some recent interest in detecting and addressing memorization
of training data by deep neural networks. A formal framework for memorization
in generative models, called "data-copying," was proposed by Meehan et. al.
(2020). We build upon their work to show that their framework may fail to
detect certain kinds of blatant memorization. Motivated by this and the theory
of non-parametric methods, we provide an alternative definition of data-copying
that applies more locally. We provide a method to detect data-copying, and
provably show that it works with high probability when enough data is
available. We also provide lower bounds that characterize the sample
requirement for reliable detection.
Related papers
- A Geometric Framework for Understanding Memorization in Generative Models [11.263296715798374]
Recent work has shown that deep generative models can be capable of memorizing and reproducing training datapoints when deployed.
These findings call into question the usability of generative models, especially in light of the legal and privacy risks brought about by memorization.
We propose the manifold memorization hypothesis (MMH), a geometric framework which leverages the manifold hypothesis into a clear language in which to reason about memorization.
arXiv Detail & Related papers (2024-10-31T18:09:01Z) - Detecting, Explaining, and Mitigating Memorization in Diffusion Models [49.438362005962375]
We introduce a straightforward yet effective method for detecting memorized prompts by inspecting the magnitude of text-conditional predictions.
Our proposed method seamlessly integrates without disrupting sampling algorithms, and delivers high accuracy even at the first generation step.
Building on our detection strategy, we unveil an explainable approach that shows the contribution of individual words or tokens to memorization.
arXiv Detail & Related papers (2024-07-31T16:13:29Z) - Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors.
In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z) - Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain.
We propose an adversarial algorithm to make the retriever component robust against distribution shift.
We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z) - Transpose Attack: Stealing Datasets with Bidirectional Training [4.166238443183223]
We show that adversaries can exfiltrate datasets from protected learning environments under the guise of legitimate models.
We propose a novel approach for detecting infected models.
arXiv Detail & Related papers (2023-11-13T15:14:50Z) - Detecting Pretraining Data from Large Language Models [90.12037980837738]
We study the pretraining data detection problem.
Given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text?
We introduce a new detection method Min-K% Prob based on a simple hypothesis.
arXiv Detail & Related papers (2023-10-25T17:21:23Z) - Tools for Verifying Neural Models' Training Data [29.322899317216407]
"Proof-of-Training-Data" allows a model trainer to convince a Verifier of the training data that produced a set of model weights.
We show experimentally that our verification procedures can catch a wide variety of attacks.
arXiv Detail & Related papers (2023-07-02T23:27:00Z) - D\'etection d'Objets dans les documents num\'eris\'es par r\'eseaux de
neurones profonds [0.0]
We study multiple tasks related to document layout analysis such as the detection of text lines, the splitting into acts or the detection of the writing support.
We propose two deep neural models following two different approaches.
arXiv Detail & Related papers (2023-01-27T14:45:45Z) - Open-sourced Dataset Protection via Backdoor Watermarking [87.15630326131901]
We propose a emphbackdoor embedding based dataset watermarking method to protect an open-sourced image-classification dataset.
We use a hypothesis test guided method for dataset verification based on the posterior probability generated by the suspicious third-party model.
arXiv Detail & Related papers (2020-10-12T16:16:27Z) - Automatic Recall Machines: Internal Replay, Continual Learning and the
Brain [104.38824285741248]
Replay in neural networks involves training on sequential data with memorized samples, which counteracts forgetting of previous behavior caused by non-stationarity.
We present a method where these auxiliary samples are generated on the fly, given only the model that is being trained for the assessed objective.
Instead the implicit memory of learned samples within the assessed model itself is exploited.
arXiv Detail & Related papers (2020-06-22T15:07:06Z) - A Non-Parametric Test to Detect Data-Copying in Generative Models [31.596356325042038]
We formalize a form of overfitting that we call emdata-copying -- where the generative model memorizes and outputs training samples or small variations thereof.
We provide a three sample non-parametric test for detecting data-copying that uses the training set, a separate sample from the target distribution, and a generated sample from the model.
arXiv Detail & Related papers (2020-04-12T18:59:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.