Data leakage in cross-modal retrieval training: A case study
- URL: http://arxiv.org/abs/2302.12258v1
- Date: Thu, 23 Feb 2023 09:51:03 GMT
- Title: Data leakage in cross-modal retrieval training: A case study
- Authors: Benno Weck and Xavier Serra
- Abstract summary: We study the recently proposed SoundDesc benchmark dataset, which was automatically sourced from the BBC Sound Effects web page.
We find that SoundDesc contains several duplicates that cause leakage of training data to the evaluation data.
We propose new training, validation, and testing splits for the dataset that we make available online.
- Score: 16.18916188804986
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent progress in text-based audio retrieval was largely propelled by
the release of suitable datasets. Since the manual creation of such datasets is
a laborious task, obtaining data from online resources can be a cheap solution
to create large-scale datasets. We study the recently proposed SoundDesc
benchmark dataset, which was automatically sourced from the BBC Sound Effects
web page. In our analysis, we find that SoundDesc contains several duplicates
that cause leakage of training data to the evaluation data. This data leakage
ultimately leads to overly optimistic retrieval performance estimates in
previous benchmarks. We propose new training, validation, and testing splits
for the dataset that we make available online. To avoid weak contamination of
the test data, we pool audio files that share similar recording setups. In our
experiments, we find that the new splits serve as a more challenging benchmark.
Related papers
- A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Improving Pretraining Data Using Perplexity Correlations [56.41097718862742]
We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations.
In controlled pretraining experiments at the 160M parameter scale on 8 benchmarks, our approach outperforms DSIR on every benchmark.
arXiv Detail & Related papers (2024-09-09T17:23:29Z) - Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - Dataset Quantization with Active Learning based Adaptive Sampling [11.157462442942775]
We show that maintaining performance is feasible even with uneven sample distributions.
We propose a novel active learning based adaptive sampling strategy to optimize the sample selection.
Our approach outperforms the state-of-the-art dataset compression methods.
arXiv Detail & Related papers (2024-07-09T23:09:18Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - When Less is More: Investigating Data Pruning for Pretraining LLMs at
Scale [12.94829977468838]
Large volumes of text data have contributed significantly to the development of large language models.
To date, efforts to prune datasets down to a higher quality subset have relied on hand-crafteds encoded as rule-based filters.
We take a wider view and explore scalable estimates of data quality that can be used to measure the quality of pretraining data.
arXiv Detail & Related papers (2023-09-08T19:34:05Z) - Dataset Distillation: A Comprehensive Review [76.26276286545284]
dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset.
This paper gives a comprehensive review and summary of recent advances in DD and its application.
arXiv Detail & Related papers (2023-01-17T17:03:28Z) - Addressing out-of-distribution label noise in webly-labelled data [8.625286650577134]
Data gathering and annotation using a search engine is a simple alternative to generating a fully human-annotated dataset.
Although web crawling is very time efficient, some of the retrieved images are unavoidably noisy.
Design robust algorithms for training on noisy data gathered from the web is an important research perspective.
arXiv Detail & Related papers (2021-10-26T13:38:50Z) - Continual Learning for Fake Audio Detection [62.54860236190694]
This paper proposes detecting fake without forgetting, a continual-learning-based method, to make the model learn new spoofing attacks incrementally.
Experiments are conducted on the ASVspoof 2019 dataset.
arXiv Detail & Related papers (2021-04-15T07:57:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.