Related papers: SemDeDup: Data-efficient learning at web-scale through semantic deduplication

SemDeDup: Data-efficient learning at web-scale through semantic deduplication

URL: http://arxiv.org/abs/2303.09540v3
Date: Wed, 22 Mar 2023 17:22:35 GMT
Title: SemDeDup: Data-efficient learning at web-scale through semantic deduplication
Authors: Amro Abbas, Kushal Tirumala, D\'aniel Simig, Surya Ganguli, Ari S. Morcos
Abstract summary: We introduce SemDeDup, a method which leverages embeddings from pre-trained models to identify and remove semantic duplicates. We show that SemDeDup can remove 50% of the data with minimal performance loss, effectively halving training time. Also, analyzing language models trained on C4, a partially curated dataset, we show that SemDeDup improves over prior approaches while providing efficiency gains.
Score: 34.38272674518666
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Progress in machine learning has been driven in large part by massive increases in data. However, large web-scale datasets such as LAION are largely uncurated beyond searches for exact duplicates, potentially leaving much redundancy. Here, we introduce SemDeDup, a method which leverages embeddings from pre-trained models to identify and remove semantic duplicates: data pairs which are semantically similar, but not exactly identical. Removing semantic duplicates preserves performance and speeds up learning. Analyzing a subset of LAION, we show that SemDeDup can remove 50% of the data with minimal performance loss, effectively halving training time. Moreover, performance increases out of distribution. Also, analyzing language models trained on C4, a partially curated dataset, we show that SemDeDup improves over prior approaches while providing efficiency gains. SemDeDup provides an example of how simple ways of leveraging quality embeddings can be used to make models learn faster with less data.

Related papers

Data Debiasing with Datamodels (D3M): Improving Subgroup Robustness via Data Selection [80.85902083005237]
We introduce Data Debiasing with Datamodels (D3M), a debiasing approach which isolates and removes specific training examples that drive the model's failures on minority groups.
arXiv Detail & Related papers (2024-06-24T17:51:01Z)
FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication [28.495688931328882]
We introduce an easy-to-implement modification to the recent SemDeDup algorithm that can reduce the negative effects that we observe. We find our proposed FairDeDup algorithm consistently leads to improved fairness metrics over SemDeDup on the FairFace and FACET datasets.
arXiv Detail & Related papers (2024-04-24T18:28:17Z)
Can LLMs Separate Instructions From Data? And What Do We Even Mean By That? [60.50127555651554]
Large Language Models (LLMs) show impressive results in numerous practical applications, but they lack essential safety features. This makes them vulnerable to manipulations such as indirect prompt injections and generally unsuitable for safety-critical tasks. We introduce a formal measure for instruction-data separation and an empirical variant that is calculable from a model's outputs.
arXiv Detail & Related papers (2024-03-11T15:48:56Z)
Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences. We formulate each task as a sequence-to-sequence problem and perform multi-task training. We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z)
Semantically Redundant Training Data Removal and Deep Model Classification Performance: A Study with Chest X-rays [5.454938535500864]
We propose an entropy-based sample scoring approach to identify and remove semantically redundant training data. We demonstrate using the publicly available NIH chest X-ray dataset that the model trained on the resulting informative subset of training data significantly outperforms the model trained on the full training set.
arXiv Detail & Related papers (2023-09-18T13:56:34Z)
Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching. Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z)
CLIP: Train Faster with Less Data [3.2575001434344286]
Deep learning models require an enormous amount of data for training. Recently there is a shift in machine learning from model-centric to data-centric approaches. We propose CLIP i.e., Curriculum Learning with Iterative data Pruning.
arXiv Detail & Related papers (2022-12-02T21:29:48Z)
Scaling Laws and Interpretability of Learning from Repeated Data [4.3242395495523525]
We train a family of models where most of the data is unique but a small fraction of it is repeated many times. We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway through training. A predictable range of repetition frequency leads to surprisingly severe degradation in performance.
arXiv Detail & Related papers (2022-05-21T02:14:27Z)
Reminding the Incremental Language Model via Data-Free Self-Distillation [26.960750314663294]
Incremental language learning with pseudo-data can alleviate catastrophic forgetting in neural networks. We propose reminding incremental language model via data-free self-distillation (DFSD) Our DFSD can exceed the previous state-of-the-art methods even if the maximum decrease in pseudo-data is 90%.
arXiv Detail & Related papers (2021-10-17T07:27:43Z)
Deduplicating Training Data Makes Language Models Better [50.22588162039083]
Existing language modeling datasets contain many near-duplicate examples and long repetitives. Over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets.
arXiv Detail & Related papers (2021-07-14T06:06:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.