Related papers: Efficient ConvBN Blocks for Transfer Learning and Beyond

Efficient ConvBN Blocks for Transfer Learning and Beyond

URL: http://arxiv.org/abs/2305.11624v2
Date: Wed, 28 Feb 2024 14:34:06 GMT
Title: Efficient ConvBN Blocks for Transfer Learning and Beyond
Authors: Kaichao You, Guo Qin, Anchang Bao, Meng Cao, Ping Huang, Jiulong Shan, Mingsheng Long
Abstract summary: A ConvBN block can operate in three modes: Train, Eval, and Deploy. This paper focuses on the trade-off between stability and efficiency in ConvBN blocks. We propose a novel Tune mode to bridge the gap between Eval mode and Deploy mode.
Score: 62.53078191019456
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Convolution-BatchNorm (ConvBN) blocks are integral components in various computer vision tasks and other domains. A ConvBN block can operate in three modes: Train, Eval, and Deploy. While the Train mode is indispensable for training models from scratch, the Eval mode is suitable for transfer learning and beyond, and the Deploy mode is designed for the deployment of models. This paper focuses on the trade-off between stability and efficiency in ConvBN blocks: Deploy mode is efficient but suffers from training instability; Eval mode is widely used in transfer learning but lacks efficiency. To solve the dilemma, we theoretically reveal the reason behind the diminished training stability observed in the Deploy mode. Subsequently, we propose a novel Tune mode to bridge the gap between Eval mode and Deploy mode. The proposed Tune mode is as stable as Eval mode for transfer learning, and its computational efficiency closely matches that of the Deploy mode. Through extensive experiments in object detection, classification, and adversarial example generation across $5$ datasets and $12$ model architectures, we demonstrate that the proposed Tune mode retains the performance while significantly reducing GPU memory footprint and training time, thereby contributing efficient ConvBN blocks for transfer learning and beyond. Our method has been integrated into both PyTorch (general machine learning framework) and MMCV/MMEngine (computer vision framework). Practitioners just need one line of code to enjoy our efficient ConvBN blocks thanks to PyTorch's builtin machine learning compilers.

Related papers

DeMo: Decoupled Momentum Optimization [6.169574689318864]
Training large neural networks typically requires sharing between accelerators through specialized high-speed interconnects. We introduce bfDecoupled textbfMomentum (DeMo), a fused magnitude and data parallel algorithm that reduces inter-accelerator communication requirements. Empirical results show that models trained with DeMo match or exceed the performance of equivalent models trained with AdamW.
arXiv Detail & Related papers (2024-11-29T17:31:47Z)
Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models [31.960749305728488]
We introduce a novel concept dubbed modular neural tangent kernel (mNTK) We show that the quality of a module's learning is tightly associated with its mNTK's principal eigenvalue $lambda_max$. We propose a novel training strategy termed Modular Adaptive Training (MAT) to update those modules with their $lambda_max$ exceeding a dynamic threshold.
arXiv Detail & Related papers (2024-05-13T07:46:48Z)
ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment [7.916080032572087]
atom is a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting. atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput. Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, atom can enhance training efficiency up to $20 times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.
arXiv Detail & Related papers (2024-03-15T17:43:43Z)
Unlocking Deep Learning: A BP-Free Approach for Parallel Block-Wise Training of Neural Networks [9.718519843862937]
We introduce a block-wise BP-free (BWBPF) neural network that leverages local error signals to optimize sub-neural networks separately. Our experimental results consistently show that this approach can identify transferable decoupled architectures for VGG and ResNet variations.
arXiv Detail & Related papers (2023-12-20T08:02:33Z)
Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models [43.1773057439246]
Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures. We explore sparse and recurrent model training on a massively parallel multiple instruction multiple data architecture with distributed local memory.
arXiv Detail & Related papers (2023-11-07T23:18:35Z)
Fast Propagation is Better: Accelerating Single-Step Adversarial Training via Sampling Subnetworks [69.54774045493227]
A drawback of adversarial training is the computational overhead introduced by the generation of adversarial examples. We propose to exploit the interior building blocks of the model to improve efficiency. Compared with previous methods, our method not only reduces the training cost but also achieves better model robustness.
arXiv Detail & Related papers (2023-10-24T01:36:20Z)
Decouple Graph Neural Networks: Train Multiple Simple GNNs Simultaneously Instead of One [60.5818387068983]
Graph neural networks (GNN) suffer from severe inefficiency. We propose to decouple a multi-layer GNN as multiple simple modules for more efficient training. We show that the proposed framework is highly efficient with reasonable performance.
arXiv Detail & Related papers (2023-04-20T07:21:32Z)
ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive Training [65.68511423300812]
We propose ProgFed, a progressive training framework for efficient and effective federated learning. ProgFed inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models. Our results show that ProgFed converges at the same rate as standard training on full models.
arXiv Detail & Related papers (2021-10-11T14:45:00Z)
A Deep Value-network Based Approach for Multi-Driver Order Dispatching [55.36656442934531]
We propose a deep reinforcement learning based solution for order dispatching. We conduct large scale online A/B tests on DiDi's ride-dispatching platform. Results show that CVNet consistently outperforms other recently proposed dispatching methods.
arXiv Detail & Related papers (2021-06-08T16:27:04Z)
Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods. We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods. Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
Mode-Assisted Unsupervised Learning of Restricted Boltzmann Machines [7.960229223744695]
We show that properly combining standard gradient updates with an off-gradient direction improves their training dramatically over traditional gradient methods. This approach, which we call mode training, promotes faster training and stability, in addition to lower converged relative entropy (KL divergence) The mode training we suggest is quite versatile, as it can be applied in conjunction with any given gradient method, and is easily extended to more general energy-based neural network structures.
arXiv Detail & Related papers (2020-01-15T21:12:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.