A Non-Parametric Test to Detect Data-Copying in Generative Models
- URL: http://arxiv.org/abs/2004.05675v1
- Date: Sun, 12 Apr 2020 18:59:29 GMT
- Title: A Non-Parametric Test to Detect Data-Copying in Generative Models
- Authors: Casey Meehan, Kamalika Chaudhuri, Sanjoy Dasgupta
- Abstract summary: We formalize a form of overfitting that we call emdata-copying -- where the generative model memorizes and outputs training samples or small variations thereof.
We provide a three sample non-parametric test for detecting data-copying that uses the training set, a separate sample from the target distribution, and a generated sample from the model.
- Score: 31.596356325042038
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting overfitting in generative models is an important challenge in
machine learning. In this work, we formalize a form of overfitting that we call
{\em{data-copying}} -- where the generative model memorizes and outputs
training samples or small variations thereof. We provide a three sample
non-parametric test for detecting data-copying that uses the training set, a
separate sample from the target distribution, and a generated sample from the
model, and study the performance of our test on several canonical models and
datasets.
For code \& examples, visit https://github.com/casey-meehan/data-copying
Related papers
- Model Equality Testing: Which Model Is This API Serving? [59.005869726179455]
We formalize detecting such distortions as Model Equality Testing, a two-sample testing problem.
A test built on a simple string kernel achieves a median of 77.4% power against a range of distortions.
We then apply this test to commercial inference APIs for four Llama models, finding that 11 out of 31 endpoints serve different distributions than reference weights released by Meta.
arXiv Detail & Related papers (2024-10-26T18:34:53Z) - Importance of Disjoint Sampling in Conventional and Transformer Models for Hyperspectral Image Classification [2.1223532600703385]
This paper presents an innovative disjoint sampling approach for training SOTA models on Hyperspectral image classification (HSIC) tasks.
By separating training, validation, and test data without overlap, the proposed method facilitates a fairer evaluation of how well a model can classify pixels it was not exposed to during training or validation.
This rigorous methodology is critical for advancing SOTA models and their real-world application to large-scale land mapping with Hyperspectral sensors.
arXiv Detail & Related papers (2024-04-23T11:40:52Z) - Test-Time Adaptation for Point Cloud Upsampling Using Meta-Learning [17.980649681325406]
We propose a test-time adaption approach to enhance model generality of point cloud upsampling.
The proposed approach leverages meta-learning to explicitly learn network parameters for test-time adaption.
Our framework is generic and can be applied in a plug-and-play manner with existing backbone networks in point cloud upsampling.
arXiv Detail & Related papers (2023-08-31T06:44:59Z) - Data-Copying in Generative Models: A Formal Framework [34.84064423819405]
A formal framework for memorization in generative models, called "data-copying," was proposed by Meehan et. al.
We build upon their work to show that their framework may fail to detect certain kinds of blatant memorization.
We provide a method to detect data-copying, and provably show that it works with high probability when enough data is available.
arXiv Detail & Related papers (2023-02-25T22:31:01Z) - D\'etection d'Objets dans les documents num\'eris\'es par r\'eseaux de
neurones profonds [0.0]
We study multiple tasks related to document layout analysis such as the detection of text lines, the splitting into acts or the detection of the writing support.
We propose two deep neural models following two different approaches.
arXiv Detail & Related papers (2023-01-27T14:45:45Z) - Generalization Properties of Retrieval-based Models [50.35325326050263]
Retrieval-based machine learning methods have enjoyed success on a wide range of problems.
Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored.
We present a formal treatment of retrieval-based models to characterize their generalization ability.
arXiv Detail & Related papers (2022-10-06T00:33:01Z) - TTAPS: Test-Time Adaption by Aligning Prototypes using Self-Supervision [70.05605071885914]
We propose a novel modification of the self-supervised training algorithm SwAV that adds the ability to adapt to single test samples.
We show the success of our method on the common benchmark dataset CIFAR10-C.
arXiv Detail & Related papers (2022-05-18T05:43:06Z) - Learning to Generalize across Domains on Single Test Samples [126.9447368941314]
We learn to generalize across domains on single test samples.
We formulate the adaptation to the single test sample as a variational Bayesian inference problem.
Our model achieves at least comparable -- and often better -- performance than state-of-the-art methods on multiple benchmarks for domain generalization.
arXiv Detail & Related papers (2022-02-16T13:21:04Z) - One for More: Selecting Generalizable Samples for Generalizable ReID
Model [92.40951770273972]
This paper proposes a one-for-more training objective that takes the generalization ability of selected samples as a loss function.
Our proposed one-for-more based sampler can be seamlessly integrated into the ReID training framework.
arXiv Detail & Related papers (2020-12-10T06:37:09Z) - Understanding Classifier Mistakes with Generative Models [88.20470690631372]
Deep neural networks are effective on supervised learning tasks, but have been shown to be brittle.
In this paper, we leverage generative models to identify and characterize instances where classifiers fail to generalize.
Our approach is agnostic to class labels from the training set which makes it applicable to models trained in a semi-supervised way.
arXiv Detail & Related papers (2020-10-05T22:13:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.