A Closer Look at Codistillation for Distributed Training
- URL: http://arxiv.org/abs/2010.02838v2
- Date: Mon, 26 Jul 2021 01:51:36 GMT
- Title: A Closer Look at Codistillation for Distributed Training
- Authors: Shagun Sodhani, Olivier Delalleau, Mahmoud Assran, Koustuv Sinha,
Nicolas Ballas, Michael Rabbat
- Abstract summary: We investigate codistillation in a distributed training setup.
We find that even at moderate batch sizes, models trained with codistillation can perform as well as models trained with synchronous data-parallel methods.
- Score: 21.08740153686464
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Codistillation has been proposed as a mechanism to share knowledge among
concurrently trained models by encouraging them to represent the same function
through an auxiliary loss. This contrasts with the more commonly used
fully-synchronous data-parallel stochastic gradient descent methods, where
different model replicas average their gradients (or parameters) at every
iteration and thus maintain identical parameters. We investigate codistillation
in a distributed training setup, complementing previous work which focused on
extremely large batch sizes. Surprisingly, we find that even at moderate batch
sizes, models trained with codistillation can perform as well as models trained
with synchronous data-parallel methods, despite using a much weaker
synchronization mechanism. These findings hold across a range of batch sizes
and learning rate schedules, as well as different kinds of models and datasets.
Obtaining this level of accuracy, however, requires properly accounting for the
regularization effect of codistillation, which we highlight through several
empirical observations. Overall, this work contributes to a better
understanding of codistillation and how to best take advantage of it in a
distributed computing environment.
Related papers
- DDIL: Improved Diffusion Distillation With Imitation Learning [57.3467234269487]
Diffusion models excel at generative modeling (e.g., text-to-image) but sampling requires multiple denoising network passes.
Progressive distillation or consistency distillation have shown promise by reducing the number of passes.
We show that DDIL consistency improves on baseline algorithms of progressive distillation (PD), Latent consistency models (LCM) and Distribution Matching Distillation (DMD2)
arXiv Detail & Related papers (2024-10-15T18:21:47Z) - Collaborative Heterogeneous Causal Inference Beyond Meta-analysis [68.4474531911361]
We propose a collaborative inverse propensity score estimator for causal inference with heterogeneous data.
Our method shows significant improvements over the methods based on meta-analysis when heterogeneity increases.
arXiv Detail & Related papers (2024-04-24T09:04:36Z) - Deep Clustering with Diffused Sampling and Hardness-aware
Self-distillation [4.550555443103878]
This paper proposes a novel end-to-end deep clustering method with diffused sampling and hardness-aware self-distillation (HaDis)
Results on five challenging image datasets demonstrate the superior clustering performance of our HaDis method over the state-of-the-art.
arXiv Detail & Related papers (2024-01-25T09:33:49Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Class-Balancing Diffusion Models [57.38599989220613]
Class-Balancing Diffusion Models (CBDM) are trained with a distribution adjustment regularizer as a solution.
Our method benchmarked the generation results on CIFAR100/CIFAR100LT dataset and shows outstanding performance on the downstream recognition task.
arXiv Detail & Related papers (2023-04-30T20:00:14Z) - Functional Ensemble Distillation [18.34081591772928]
We investigate how to best distill an ensemble's predictions using an efficient model.
We find that learning the distilled model via a simple augmentation scheme in the form of mixup augmentation significantly boosts the performance.
arXiv Detail & Related papers (2022-06-05T14:07:17Z) - Structured Pruning Learns Compact and Accurate Models [28.54826400747667]
We propose a task-specific structured pruning method CoFi (Coarse- and Fine-grained Pruning)
CoFi delivers highly parallelizableworks and matches the distillation methods in both accuracy and latency.
Our experiments on GLUE and SQuAD datasets show that CoFi yields models with over 10x speedups with a small accuracy drop.
arXiv Detail & Related papers (2022-04-01T13:09:56Z) - Diversity Matters When Learning From Ensembles [20.05842308307947]
Deep ensembles excel in large-scale image classification tasks both in terms of prediction accuracy and calibration.
Despite being simple to train, the computation and memory cost of deep ensembles limits their practicability.
We propose a simple approach for reducing this gap, i.e., making the distilled performance close to the full ensemble.
arXiv Detail & Related papers (2021-10-27T03:44:34Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z) - Contrastive learning of strong-mixing continuous-time stochastic
processes [53.82893653745542]
Contrastive learning is a family of self-supervised methods where a model is trained to solve a classification task constructed from unlabeled data.
We show that a properly constructed contrastive learning task can be used to estimate the transition kernel for small-to-mid-range intervals in the diffusion case.
arXiv Detail & Related papers (2021-03-03T23:06:47Z) - Robust Correction of Sampling Bias Using Cumulative Distribution
Functions [19.551668880584973]
Varying domains and biased datasets can lead to differences between the training and the target distributions.
Current approaches for alleviating this often rely on estimating the ratio of training and target probability density functions.
arXiv Detail & Related papers (2020-10-23T22:13:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.