SemDeDup: Data-efficient learning at web-scale through semantic
deduplication
- URL: http://arxiv.org/abs/2303.09540v3
- Date: Wed, 22 Mar 2023 17:22:35 GMT
- Title: SemDeDup: Data-efficient learning at web-scale through semantic
deduplication
- Authors: Amro Abbas, Kushal Tirumala, D\'aniel Simig, Surya Ganguli, Ari S.
Morcos
- Abstract summary: We introduce SemDeDup, a method which leverages embeddings from pre-trained models to identify and remove semantic duplicates.
We show that SemDeDup can remove 50% of the data with minimal performance loss, effectively halving training time.
Also, analyzing language models trained on C4, a partially curated dataset, we show that SemDeDup improves over prior approaches while providing efficiency gains.
- Score: 34.38272674518666
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Progress in machine learning has been driven in large part by massive
increases in data. However, large web-scale datasets such as LAION are largely
uncurated beyond searches for exact duplicates, potentially leaving much
redundancy. Here, we introduce SemDeDup, a method which leverages embeddings
from pre-trained models to identify and remove semantic duplicates: data pairs
which are semantically similar, but not exactly identical. Removing semantic
duplicates preserves performance and speeds up learning. Analyzing a subset of
LAION, we show that SemDeDup can remove 50% of the data with minimal
performance loss, effectively halving training time. Moreover, performance
increases out of distribution. Also, analyzing language models trained on C4, a
partially curated dataset, we show that SemDeDup improves over prior approaches
while providing efficiency gains. SemDeDup provides an example of how simple
ways of leveraging quality embeddings can be used to make models learn faster
with less data.
Related papers
- Data Debiasing with Datamodels (D3M): Improving Subgroup Robustness via Data Selection [80.85902083005237]
We introduce Data Debiasing with Datamodels (D3M), a debiasing approach which isolates and removes specific training examples that drive the model's failures on minority groups.
arXiv Detail & Related papers (2024-06-24T17:51:01Z) - FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication [28.495688931328882]
We introduce an easy-to-implement modification to the recent SemDeDup algorithm that can reduce the negative effects that we observe.
We find our proposed FairDeDup algorithm consistently leads to improved fairness metrics over SemDeDup on the FairFace and FACET datasets.
arXiv Detail & Related papers (2024-04-24T18:28:17Z) - Can LLMs Separate Instructions From Data? And What Do We Even Mean By That? [60.50127555651554]
Large Language Models (LLMs) show impressive results in numerous practical applications, but they lack essential safety features.
This makes them vulnerable to manipulations such as indirect prompt injections and generally unsuitable for safety-critical tasks.
We introduce a formal measure for instruction-data separation and an empirical variant that is calculable from a model's outputs.
arXiv Detail & Related papers (2024-03-11T15:48:56Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Semantically Redundant Training Data Removal and Deep Model
Classification Performance: A Study with Chest X-rays [5.454938535500864]
We propose an entropy-based sample scoring approach to identify and remove semantically redundant training data.
We demonstrate using the publicly available NIH chest X-ray dataset that the model trained on the resulting informative subset of training data significantly outperforms the model trained on the full training set.
arXiv Detail & Related papers (2023-09-18T13:56:34Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - CLIP: Train Faster with Less Data [3.2575001434344286]
Deep learning models require an enormous amount of data for training.
Recently there is a shift in machine learning from model-centric to data-centric approaches.
We propose CLIP i.e., Curriculum Learning with Iterative data Pruning.
arXiv Detail & Related papers (2022-12-02T21:29:48Z) - Scaling Laws and Interpretability of Learning from Repeated Data [4.3242395495523525]
We train a family of models where most of the data is unique but a small fraction of it is repeated many times.
We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway through training.
A predictable range of repetition frequency leads to surprisingly severe degradation in performance.
arXiv Detail & Related papers (2022-05-21T02:14:27Z) - Reminding the Incremental Language Model via Data-Free Self-Distillation [26.960750314663294]
Incremental language learning with pseudo-data can alleviate catastrophic forgetting in neural networks.
We propose reminding incremental language model via data-free self-distillation (DFSD)
Our DFSD can exceed the previous state-of-the-art methods even if the maximum decrease in pseudo-data is 90%.
arXiv Detail & Related papers (2021-10-17T07:27:43Z) - Deduplicating Training Data Makes Language Models Better [50.22588162039083]
Existing language modeling datasets contain many near-duplicate examples and long repetitives.
Over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data.
We develop two tools that allow us to deduplicate training datasets.
arXiv Detail & Related papers (2021-07-14T06:06:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.