Training on the Test Model: Contamination in Ranking Distillation
- URL: http://arxiv.org/abs/2411.02284v1
- Date: Mon, 04 Nov 2024 17:11:14 GMT
- Title: Training on the Test Model: Contamination in Ranking Distillation
- Authors: Vishakha Suresh Kalal, Andrew Parry, Sean MacAvaney,
- Abstract summary: We investigate the effect of a contaminated teacher model in a distillation setting.
We find that contamination occurs even when the test data represents a small fraction of the teacher's training samples.
- Score: 14.753216172912968
- License:
- Abstract: Neural approaches to ranking based on pre-trained language models are highly effective in ad-hoc search. However, the computational expense of these models can limit their application. As such, a process known as knowledge distillation is frequently applied to allow a smaller, efficient model to learn from an effective but expensive model. A key example of this is the distillation of expensive API-based commercial Large Language Models into smaller production-ready models. However, due to the opacity of training data and processes of most commercial models, one cannot ensure that a chosen test collection has not been observed previously, creating the potential for inadvertent data contamination. We, therefore, investigate the effect of a contaminated teacher model in a distillation setting. We evaluate several distillation techniques to assess the degree to which contamination occurs during distillation. By simulating a ``worst-case'' setting where the degree of contamination is known, we find that contamination occurs even when the test data represents a small fraction of the teacher's training samples. We, therefore, encourage caution when training using black-box teacher models where data provenance is ambiguous.
Related papers
- uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation in Low-Data Regimes [34.947522647009436]
We show that it is possible to distill large Whisper models into relatively small ones without using any labeled data.
Our models are also 25-50% more compute- and memory-efficient while maintaining performance equal to or better than that of the teacher model.
arXiv Detail & Related papers (2024-07-01T13:07:01Z) - Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits! [51.668411293817464]
Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines.
Academic research is often restrained to public datasets on the order of ten thousand samples.
We devise an approach to generate a benchmark of difficulty from a pool of available samples.
arXiv Detail & Related papers (2023-12-25T21:25:55Z) - Estimating Contamination via Perplexity: Quantifying Memorisation in
Language Model Evaluation [2.4173424114751114]
We propose a novel method to quantify contamination without the access of the full training set.
Our analysis provides evidence of significant memorisation of recent foundation models in popular reading comprehension, summarisation benchmarks, while multiple choice appears less contaminated.
arXiv Detail & Related papers (2023-09-19T15:02:58Z) - BOOT: Data-free Distillation of Denoising Diffusion Models with
Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images.
Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few.
We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z) - HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z) - Recognition of Defective Mineral Wool Using Pruned ResNet Models [88.24021148516319]
We developed a visual quality control system for mineral wool.
X-ray images of wool specimens were collected to create a training set of defective and non-defective samples.
We obtained a model with more than 98% accuracy, which in comparison to the current procedure used at the company, it can recognize 20% more defective products.
arXiv Detail & Related papers (2022-11-01T13:58:02Z) - Robust and Resource-Efficient Data-Free Knowledge Distillation by Generative Pseudo Replay [5.3330804968579795]
Data-Free Knowledge Distillation (KD) allows knowledge transfer from a trained neural network (teacher) to a more compact one (student) in the absence of original training data.
Existing works use a validation set to monitor the accuracy of the student over real data and report the highest performance throughout the entire process.
However, validation data may not be available at distillation time either, making it infeasible to record the student snapshot that achieved the peak accuracy.
This is challenging because the student experiences knowledge degradation due to the distribution shift of the synthetic data.
We propose to model the distribution of the previously observed synthetic samples
arXiv Detail & Related papers (2022-01-09T14:14:28Z) - Churn Reduction via Distillation [54.5952282395487]
We show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn.
We then show that distillation performs strongly for low churn training against a number of recent baselines.
arXiv Detail & Related papers (2021-06-04T18:03:31Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.