Data Contamination: From Memorization to Exploitation
- URL: http://arxiv.org/abs/2203.08242v1
- Date: Tue, 15 Mar 2022 20:37:16 GMT
- Title: Data Contamination: From Memorization to Exploitation
- Authors: Inbal Magar and Roy Schwartz
- Abstract summary: It is not clear to what extent models exploit contaminated data for downstream tasks.
We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the relevant task.
Experiments with two models and three downstream tasks show that exploitation exists in some cases, but in others the models memorize the contaminated data, but do not exploit it.
- Score: 5.997909991352044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained language models are typically trained on massive web-based
datasets, which are often "contaminated" with downstream test sets. It is not
clear to what extent models exploit the contaminated data for downstream tasks.
We present a principled method to study this question. We pretrain BERT models
on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune
them on the relevant task. Comparing performance between samples seen and
unseen during pretraining enables us to define and quantify levels of
memorization and exploitation. Experiments with two models and three downstream
tasks show that exploitation exists in some cases, but in others the models
memorize the contaminated data, but do not exploit it. We show that these two
measures are affected by different factors such as the number of duplications
of the contaminated data and the model size. Our results highlight the
importance of analyzing massive web-scale datasets to verify that progress in
NLP is obtained by better language understanding and not better data
exploitation.
Related papers
- Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification [51.35500308126506]
Self-supervised learning (SSL) is a machine learning approach where the data itself provides supervision, eliminating the need for external labels.
We study how classification-based evaluation protocols for SSL correlate and how well they predict downstream performance on different dataset types.
arXiv Detail & Related papers (2024-07-16T23:17:36Z) - A Taxonomy for Data Contamination in Large Language Models [12.643103231497813]
A growing concern is data contamination, where evaluation datasets may be contained in the pretraining corpus.
Decontamination, the process of detecting and removing such data, is a potential solution.
How different types of contamination impact the performance of language models on downstream tasks is not fully understood.
arXiv Detail & Related papers (2024-07-11T17:50:34Z) - Extracting Training Data from Unconditional Diffusion Models [76.85077961718875]
diffusion probabilistic models (DPMs) are being employed as mainstream models for generative artificial intelligence (AI)
We aim to establish a theoretical understanding of memorization in DPMs with 1) a memorization metric for theoretical analysis, 2) an analysis of conditional memorization with informative and random labels, and 3) two better evaluation metrics for measuring memorization.
Based on the theoretical analysis, we propose a novel data extraction method called textbfSurrogate condItional Data Extraction (SIDE) that leverages a trained on generated data as a surrogate condition to extract training data directly from unconditional diffusion models.
arXiv Detail & Related papers (2024-06-18T16:20:12Z) - Mind the Gap Between Synthetic and Real: Utilizing Transfer Learning to Probe the Boundaries of Stable Diffusion Generated Data [2.6016285265085526]
Student models show a significant drop in accuracy compared to models trained on real data.
By training these layers using either real or synthetic data, we reveal that the drop mainly stems from the model's final layers.
Our results suggest an improved trade-off between the amount of real training data used and the model's accuracy.
arXiv Detail & Related papers (2024-05-06T07:51:13Z) - Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals [91.59906995214209]
We propose a new evaluation method, Counterfactual Attentiveness Test (CAT)
CAT uses counterfactuals by replacing part of the input with its counterpart from a different example, expecting an attentive model to change its prediction.
We show that GPT3 becomes less attentive with an increased number of demonstrations, while its accuracy on the test data improves.
arXiv Detail & Related papers (2023-11-16T06:27:35Z) - From Zero to Hero: Detecting Leaked Data through Synthetic Data Injection and Model Querying [10.919336198760808]
We introduce a novel methodology to detect leaked data that are used to train classification models.
textscLDSS involves injecting a small volume of synthetic data--characterized by local shifts in class distribution--into the owner's dataset.
This enables the effective identification of models trained on leaked data through model querying alone.
arXiv Detail & Related papers (2023-10-06T10:36:28Z) - How Well Do Sparse Imagenet Models Transfer? [75.98123173154605]
Transfer learning is a classic paradigm by which models pretrained on large "upstream" datasets are adapted to yield good results on "downstream" datasets.
In this work, we perform an in-depth investigation of this phenomenon in the context of convolutional neural networks (CNNs) trained on the ImageNet dataset.
We show that sparse models can match or even outperform the transfer performance of dense models, even at high sparsities.
arXiv Detail & Related papers (2021-11-26T11:58:51Z) - On the Efficacy of Adversarial Data Collection for Question Answering:
Results from a Large-Scale Randomized Study [65.17429512679695]
In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions.
Despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models.
arXiv Detail & Related papers (2021-06-02T00:48:33Z) - Data Impressions: Mining Deep Models to Extract Samples for Data-free
Applications [26.48630545028405]
"Data Impressions" act as proxy to the training data and can be used to realize a variety of tasks.
We show the applicability of data impressions in solving several computer vision tasks.
arXiv Detail & Related papers (2021-01-15T11:37:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.