Related papers: Revisiting Locally Supervised Learning: an Alternative to End-to-end Training

Revisiting Locally Supervised Learning: an Alternative to End-to-end Training

URL: http://arxiv.org/abs/2101.10832v1
Date: Tue, 26 Jan 2021 15:02:18 GMT
Title: Revisiting Locally Supervised Learning: an Alternative to End-to-end Training
Authors: Yulin Wang, Zanlin Ni, Shiji Song, Le Yang, Gao Huang
Abstract summary: We propose an information propagation (InfoPro) loss, which encourages local modules to preserve as much useful information as possible. We show that InfoPro is capable of achieving competitive performance with less than 40% memory footprint compared to E2E training.
Score: 36.43515074019875
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Due to the need to store the intermediate activations for back-propagation, end-to-end (E2E) training of deep networks usually suffers from high GPUs memory footprint. This paper aims to address this problem by revisiting the locally supervised learning, where a network is split into gradient-isolated modules and trained with local supervision. We experimentally show that simply training local modules with E2E loss tends to collapse task-relevant information at early layers, and hence hurts the performance of the full model. To avoid this issue, we propose an information propagation (InfoPro) loss, which encourages local modules to preserve as much useful information as possible, while progressively discard task-irrelevant information. As InfoPro loss is difficult to compute in its original form, we derive a feasible upper bound as a surrogate optimization objective, yielding a simple but effective algorithm. In fact, we show that the proposed method boils down to minimizing the combination of a reconstruction loss and a normal cross-entropy/contrastive term. Extensive empirical results on five datasets (i.e., CIFAR, SVHN, STL-10, ImageNet and Cityscapes) validate that InfoPro is capable of achieving competitive performance with less than 40% memory footprint compared to E2E training, while allowing using training data with higher-resolution or larger batch sizes under the same GPU memory constraint. Our method also enables training local modules asynchronously for potential training acceleration. Code is available at: https://github.com/blackfeather-wang/InfoPro-Pytorch.

Related papers

Code generation and runtime techniques for enabling data-efficient deep learning training on GPUs [8.00550423071637]
This dissertation analyzes data inefficiency in representative deep training tasks, specifically in graph neural networks (GNNs) and large language models (LLMs) It proposes novel runtime and code generation techniques to mitigate these challenges and implements these optimizations seamlessly within the PyTorch stack.
arXiv Detail & Related papers (2024-12-06T03:20:03Z)
Effective pruning of web-scale datasets based on complexity of concept clusters [48.125618324485195]
We present a method for pruning large-scale multimodal datasets for training CLIP-style models on ImageNet. We find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs. We achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
arXiv Detail & Related papers (2024-01-09T14:32:24Z)
Go beyond End-to-End Training: Boosting Greedy Local Learning with Context Supply [0.12187048691454236]
greedy local learning partitions the network into gradient-isolated modules and trains supervisely based on local preliminary losses. As the number of segmentations of the gradient-isolated module increases, the performance of the local learning scheme degrades substantially. We propose a ContSup scheme, which incorporates context supply between isolated modules to compensate for information loss.
arXiv Detail & Related papers (2023-12-12T10:25:31Z)
PREM: A Simple Yet Effective Approach for Node-Level Graph Anomaly Detection [65.24854366973794]
Node-level graph anomaly detection (GAD) plays a critical role in identifying anomalous nodes from graph-structured data in domains such as medicine, social networks, and e-commerce. We introduce a simple method termed PREprocessing and Matching (PREM for short) to improve the efficiency of GAD. Our approach streamlines GAD, reducing time and memory consumption while maintaining powerful anomaly detection capabilities.
arXiv Detail & Related papers (2023-10-18T02:59:57Z)
Selectivity Drives Productivity: Efficient Dataset Pruning for Enhanced Transfer Learning [66.20311762506702]
dataset pruning (DP) has emerged as an effective way to improve data efficiency. We propose two new DP methods, label mapping and feature mapping, for supervised and self-supervised pretraining settings. We show that source data classes can be pruned by up to 40% 80% without sacrificing downstream performance.
arXiv Detail & Related papers (2023-10-13T00:07:49Z)
Decouple Graph Neural Networks: Train Multiple Simple GNNs Simultaneously Instead of One [60.5818387068983]
Graph neural networks (GNN) suffer from severe inefficiency. We propose to decouple a multi-layer GNN as multiple simple modules for more efficient training. We show that the proposed framework is highly efficient with reasonable performance.
arXiv Detail & Related papers (2023-04-20T07:21:32Z)
On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z)
BackLink: Supervised Local Training with Backward Links [2.104758015212034]
This work proposes a novel local training algorithm, BackLink, which introduces inter- module backward dependency and allows errors to flow between modules. Our method can lead up to a 79% reduction in memory cost and 52% in simulation runtime in ResNet110 compared to the standard BP.
arXiv Detail & Related papers (2022-05-14T21:49:47Z)
Acceleration of Federated Learning with Alleviated Forgetting in Local Training [61.231021417674235]
Federated learning (FL) enables distributed optimization of machine learning models while protecting privacy. We propose FedReg, an algorithm to accelerate FL with alleviated knowledge forgetting in the local training stage. Our experiments demonstrate that FedReg not only significantly improves the convergence rate of FL, especially when the neural network architecture is deep.
arXiv Detail & Related papers (2022-03-05T02:31:32Z)
Data optimization for large batch distributed training of deep neural networks [0.19336815376402716]
Current practice for distributed training of deep neural networks faces the challenges of communication bottlenecks when operating at scale. We propose a data optimization approach that utilize machine learning to implicitly smooth out the loss landscape resulting in fewer local minima. Our approach filters out data points which are less important to feature learning, enabling us to speed up the training of models on larger batch sizes to improved accuracy.
arXiv Detail & Related papers (2020-12-16T21:22:02Z)
Improving compute efficacy frontiers with SliceOut [31.864949424541344]
We introduce SliceOut -- a dropout-inspired scheme to train deep learning models faster without impacting final test accuracy. At test time, turning off SliceOut performs an implicit ensembling across a linear number of architectures that preserves test accuracy. This leads to faster processing of large computational workloads overall, and significantly reduce the resulting energy consumption and CO2emissions.
arXiv Detail & Related papers (2020-07-21T15:59:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.