On Pretraining Data Diversity for Self-Supervised Learning
- URL: http://arxiv.org/abs/2403.13808v3
- Date: Thu, 18 Jul 2024 09:15:00 GMT
- Title: On Pretraining Data Diversity for Self-Supervised Learning
- Authors: Hasan Abed Al Kader Hammoud, Tuhin Das, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem,
- Abstract summary: We explore the impact of training with more diverse datasets on the performance of self-supervised learning (SSL) under a fixed computational budget.
Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal.
- Score: 57.91495006862553
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We explore the impact of training with more diverse datasets, characterized by the number of unique samples, on the performance of self-supervised learning (SSL) under a fixed computational budget. Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal. Notably, even with an exceptionally large pretraining data diversity achieved through methods like web crawling or diffusion-generated data, among other ways, the distribution shift remains a challenge. Our experiments are comprehensive with seven SSL methods using large-scale datasets such as ImageNet and YFCC100M amounting to over 200 GPU days. Code and trained models are available at https://github.com/hammoudhasan/DiversitySSL
Related papers
- A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification [51.35500308126506]
Self-supervised learning (SSL) is a machine learning approach where the data itself provides supervision, eliminating the need for external labels.
We study how classification-based evaluation protocols for SSL correlate and how well they predict downstream performance on different dataset types.
arXiv Detail & Related papers (2024-07-16T23:17:36Z) - On Improving the Algorithm-, Model-, and Data- Efficiency of Self-Supervised Learning [18.318758111829386]
We propose an efficient single-branch SSL method based on non-parametric instance discrimination.
We also propose a novel self-distillation loss that minimizes the KL divergence between the probability distribution and its square root version.
arXiv Detail & Related papers (2024-04-30T06:39:04Z) - Federated Multilingual Models for Medical Transcript Analysis [11.877236847857336]
We present a federated learning system for training a large-scale multi-lingual model.
None of the training data is ever transmitted to any central location.
We show that the global model performance can be further improved by a training step performed locally.
arXiv Detail & Related papers (2022-11-04T01:07:54Z) - Semi-supervised Learning with Deterministic Labeling and Large Margin
Projection [25.398314796157933]
The centrality and diversity of the labeled data are very influential to the performance of semi-supervised learning (SSL)
This study is to learn a kernelized large margin metric for a small amount of most stable and most divergent data that are recognized based on the OLF structure.
Attribute to this novel design, the accuracy and performance stableness of the SSL model based on OLF is significantly improved compared with its baseline methods.
arXiv Detail & Related papers (2022-08-17T04:09:35Z) - DATA: Domain-Aware and Task-Aware Pre-training [94.62676913928831]
We present DATA, a simple yet effective NAS approach specialized for self-supervised learning (SSL)
Our method achieves promising results across a wide range of computation costs on downstream tasks, including image classification, object detection and semantic segmentation.
arXiv Detail & Related papers (2022-03-17T02:38:49Z) - Self-Supervised Learning of Graph Neural Networks: A Unified Review [50.71341657322391]
Self-supervised learning is emerging as a new paradigm for making use of large amounts of unlabeled samples.
We provide a unified review of different ways of training graph neural networks (GNNs) using SSL.
Our treatment of SSL methods for GNNs sheds light on the similarities and differences of various methods, setting the stage for developing new methods and algorithms.
arXiv Detail & Related papers (2021-02-22T03:43:45Z) - Quasi-Global Momentum: Accelerating Decentralized Deep Learning on
Heterogeneous Data [77.88594632644347]
Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks.
In realistic learning scenarios, the presence of heterogeneity across different clients' local datasets poses an optimization challenge.
We propose a novel momentum-based method to mitigate this decentralized training difficulty.
arXiv Detail & Related papers (2021-02-09T11:27:14Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z) - Federated Visual Classification with Real-World Data Distribution [9.564468846277366]
We characterize the effect real-world data distributions have on distributed learning, using as a benchmark the standard Federated Averaging (FedAvg) algorithm.
We introduce two new large-scale datasets for species and landmark classification, with realistic per-user data splits.
We also develop two new algorithms (FedVC, FedIR) that intelligently resample and reweight over the client pool, bringing large improvements in accuracy and stability in training.
arXiv Detail & Related papers (2020-03-18T07:55:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.