Revisiting Locally Supervised Learning: an Alternative to End-to-end
Training
- URL: http://arxiv.org/abs/2101.10832v1
- Date: Tue, 26 Jan 2021 15:02:18 GMT
- Title: Revisiting Locally Supervised Learning: an Alternative to End-to-end
Training
- Authors: Yulin Wang, Zanlin Ni, Shiji Song, Le Yang, Gao Huang
- Abstract summary: We propose an information propagation (InfoPro) loss, which encourages local modules to preserve as much useful information as possible.
We show that InfoPro is capable of achieving competitive performance with less than 40% memory footprint compared to E2E training.
- Score: 36.43515074019875
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the need to store the intermediate activations for back-propagation,
end-to-end (E2E) training of deep networks usually suffers from high GPUs
memory footprint. This paper aims to address this problem by revisiting the
locally supervised learning, where a network is split into gradient-isolated
modules and trained with local supervision. We experimentally show that simply
training local modules with E2E loss tends to collapse task-relevant
information at early layers, and hence hurts the performance of the full model.
To avoid this issue, we propose an information propagation (InfoPro) loss,
which encourages local modules to preserve as much useful information as
possible, while progressively discard task-irrelevant information. As InfoPro
loss is difficult to compute in its original form, we derive a feasible upper
bound as a surrogate optimization objective, yielding a simple but effective
algorithm. In fact, we show that the proposed method boils down to minimizing
the combination of a reconstruction loss and a normal cross-entropy/contrastive
term. Extensive empirical results on five datasets (i.e., CIFAR, SVHN, STL-10,
ImageNet and Cityscapes) validate that InfoPro is capable of achieving
competitive performance with less than 40% memory footprint compared to E2E
training, while allowing using training data with higher-resolution or larger
batch sizes under the same GPU memory constraint. Our method also enables
training local modules asynchronously for potential training acceleration. Code
is available at: https://github.com/blackfeather-wang/InfoPro-Pytorch.
Related papers
- Effective pruning of web-scale datasets based on complexity of concept
clusters [48.125618324485195]
We present a method for pruning large-scale multimodal datasets for training CLIP-style models on ImageNet.
We find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs.
We achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
arXiv Detail & Related papers (2024-01-09T14:32:24Z) - Go beyond End-to-End Training: Boosting Greedy Local Learning with
Context Supply [0.12187048691454236]
greedy local learning partitions the network into gradient-isolated modules and trains supervisely based on local preliminary losses.
As the number of segmentations of the gradient-isolated module increases, the performance of the local learning scheme degrades substantially.
We propose a ContSup scheme, which incorporates context supply between isolated modules to compensate for information loss.
arXiv Detail & Related papers (2023-12-12T10:25:31Z) - PREM: A Simple Yet Effective Approach for Node-Level Graph Anomaly
Detection [65.24854366973794]
Node-level graph anomaly detection (GAD) plays a critical role in identifying anomalous nodes from graph-structured data in domains such as medicine, social networks, and e-commerce.
We introduce a simple method termed PREprocessing and Matching (PREM for short) to improve the efficiency of GAD.
Our approach streamlines GAD, reducing time and memory consumption while maintaining powerful anomaly detection capabilities.
arXiv Detail & Related papers (2023-10-18T02:59:57Z) - Selectivity Drives Productivity: Efficient Dataset Pruning for Enhanced
Transfer Learning [66.20311762506702]
dataset pruning (DP) has emerged as an effective way to improve data efficiency.
We propose two new DP methods, label mapping and feature mapping, for supervised and self-supervised pretraining settings.
We show that source data classes can be pruned by up to 40% 80% without sacrificing downstream performance.
arXiv Detail & Related papers (2023-10-13T00:07:49Z) - Decouple Graph Neural Networks: Train Multiple Simple GNNs Simultaneously Instead of One [60.5818387068983]
Graph neural networks (GNN) suffer from severe inefficiency.
We propose to decouple a multi-layer GNN as multiple simple modules for more efficient training.
We show that the proposed framework is highly efficient with reasonable performance.
arXiv Detail & Related papers (2023-04-20T07:21:32Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - BackLink: Supervised Local Training with Backward Links [2.104758015212034]
This work proposes a novel local training algorithm, BackLink, which introduces inter- module backward dependency and allows errors to flow between modules.
Our method can lead up to a 79% reduction in memory cost and 52% in simulation runtime in ResNet110 compared to the standard BP.
arXiv Detail & Related papers (2022-05-14T21:49:47Z) - Acceleration of Federated Learning with Alleviated Forgetting in Local
Training [61.231021417674235]
Federated learning (FL) enables distributed optimization of machine learning models while protecting privacy.
We propose FedReg, an algorithm to accelerate FL with alleviated knowledge forgetting in the local training stage.
Our experiments demonstrate that FedReg not only significantly improves the convergence rate of FL, especially when the neural network architecture is deep.
arXiv Detail & Related papers (2022-03-05T02:31:32Z) - Data optimization for large batch distributed training of deep neural
networks [0.19336815376402716]
Current practice for distributed training of deep neural networks faces the challenges of communication bottlenecks when operating at scale.
We propose a data optimization approach that utilize machine learning to implicitly smooth out the loss landscape resulting in fewer local minima.
Our approach filters out data points which are less important to feature learning, enabling us to speed up the training of models on larger batch sizes to improved accuracy.
arXiv Detail & Related papers (2020-12-16T21:22:02Z) - Improving compute efficacy frontiers with SliceOut [31.864949424541344]
We introduce SliceOut -- a dropout-inspired scheme to train deep learning models faster without impacting final test accuracy.
At test time, turning off SliceOut performs an implicit ensembling across a linear number of architectures that preserves test accuracy.
This leads to faster processing of large computational workloads overall, and significantly reduce the resulting energy consumption and CO2emissions.
arXiv Detail & Related papers (2020-07-21T15:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.