Related papers: A Closer Look at Codistillation for Distributed Training

A Closer Look at Codistillation for Distributed Training

URL: http://arxiv.org/abs/2010.02838v2
Date: Mon, 26 Jul 2021 01:51:36 GMT
Title: A Closer Look at Codistillation for Distributed Training
Authors: Shagun Sodhani, Olivier Delalleau, Mahmoud Assran, Koustuv Sinha, Nicolas Ballas, Michael Rabbat
Abstract summary: We investigate codistillation in a distributed training setup. We find that even at moderate batch sizes, models trained with codistillation can perform as well as models trained with synchronous data-parallel methods.
Score: 21.08740153686464
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Codistillation has been proposed as a mechanism to share knowledge among concurrently trained models by encouraging them to represent the same function through an auxiliary loss. This contrasts with the more commonly used fully-synchronous data-parallel stochastic gradient descent methods, where different model replicas average their gradients (or parameters) at every iteration and thus maintain identical parameters. We investigate codistillation in a distributed training setup, complementing previous work which focused on extremely large batch sizes. Surprisingly, we find that even at moderate batch sizes, models trained with codistillation can perform as well as models trained with synchronous data-parallel methods, despite using a much weaker synchronization mechanism. These findings hold across a range of batch sizes and learning rate schedules, as well as different kinds of models and datasets. Obtaining this level of accuracy, however, requires properly accounting for the regularization effect of codistillation, which we highlight through several empirical observations. Overall, this work contributes to a better understanding of codistillation and how to best take advantage of it in a distributed computing environment.

Related papers

Generative Dataset Distillation Based on Self-knowledge Distillation [49.20086587208214]
We present a novel generative dataset distillation method that can improve the accuracy of aligning prediction logits. Our approach integrates self-knowledge distillation to achieve more precise distribution matching between the synthetic and original data. Our method outperforms existing state-of-the-art methods, resulting in superior distillation performance.
arXiv Detail & Related papers (2025-01-08T00:43:31Z)
DDIL: Improved Diffusion Distillation With Imitation Learning [57.3467234269487]
Diffusion models excel at generative modeling (e.g., text-to-image) but sampling requires multiple denoising network passes. Progressive distillation or consistency distillation have shown promise by reducing the number of passes. We show that DDIL consistency improves on baseline algorithms of progressive distillation (PD), Latent consistency models (LCM) and Distribution Matching Distillation (DMD2)
arXiv Detail & Related papers (2024-10-15T18:21:47Z)
Collaborative Heterogeneous Causal Inference Beyond Meta-analysis [68.4474531911361]
We propose a collaborative inverse propensity score estimator for causal inference with heterogeneous data. Our method shows significant improvements over the methods based on meta-analysis when heterogeneity increases.
arXiv Detail & Related papers (2024-04-24T09:04:36Z)
Deep Clustering with Diffused Sampling and Hardness-aware Self-distillation [4.550555443103878]
This paper proposes a novel end-to-end deep clustering method with diffused sampling and hardness-aware self-distillation (HaDis) Results on five challenging image datasets demonstrate the superior clustering performance of our HaDis method over the state-of-the-art.
arXiv Detail & Related papers (2024-01-25T09:33:49Z)
Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z)
Class-Balancing Diffusion Models [57.38599989220613]
Class-Balancing Diffusion Models (CBDM) are trained with a distribution adjustment regularizer as a solution. Our method benchmarked the generation results on CIFAR100/CIFAR100LT dataset and shows outstanding performance on the downstream recognition task.
arXiv Detail & Related papers (2023-04-30T20:00:14Z)
Functional Ensemble Distillation [18.34081591772928]
We investigate how to best distill an ensemble's predictions using an efficient model. We find that learning the distilled model via a simple augmentation scheme in the form of mixup augmentation significantly boosts the performance.
arXiv Detail & Related papers (2022-06-05T14:07:17Z)
Structured Pruning Learns Compact and Accurate Models [28.54826400747667]
We propose a task-specific structured pruning method CoFi (Coarse- and Fine-grained Pruning) CoFi delivers highly parallelizableworks and matches the distillation methods in both accuracy and latency. Our experiments on GLUE and SQuAD datasets show that CoFi yields models with over 10x speedups with a small accuracy drop.
arXiv Detail & Related papers (2022-04-01T13:09:56Z)
Diversity Matters When Learning From Ensembles [20.05842308307947]
Deep ensembles excel in large-scale image classification tasks both in terms of prediction accuracy and calibration. Despite being simple to train, the computation and memory cost of deep ensembles limits their practicability. We propose a simple approach for reducing this gap, i.e., making the distilled performance close to the full ensemble.
arXiv Detail & Related papers (2021-10-27T03:44:34Z)
Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective. Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z)
Contrastive learning of strong-mixing continuous-time stochastic processes [53.82893653745542]
Contrastive learning is a family of self-supervised methods where a model is trained to solve a classification task constructed from unlabeled data. We show that a properly constructed contrastive learning task can be used to estimate the transition kernel for small-to-mid-range intervals in the diffusion case.
arXiv Detail & Related papers (2021-03-03T23:06:47Z)
Robust Correction of Sampling Bias Using Cumulative Distribution Functions [19.551668880584973]
Varying domains and biased datasets can lead to differences between the training and the target distributions. Current approaches for alleviating this often rely on estimating the ratio of training and target probability density functions.
arXiv Detail & Related papers (2020-10-23T22:13:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.