Making EfficientNet More Efficient: Exploring Batch-Independent
Normalization, Group Convolutions and Reduced Resolution Training
- URL: http://arxiv.org/abs/2106.03640v1
- Date: Mon, 7 Jun 2021 14:10:52 GMT
- Title: Making EfficientNet More Efficient: Exploring Batch-Independent
Normalization, Group Convolutions and Reduced Resolution Training
- Authors: Dominic Masters, Antoine Labatie, Zach Eaton-Rosen and Carlo Luschi
- Abstract summary: We focus on improving the practical efficiency of the state-of-the-art EfficientNet models on a new class of accelerator, the Graphcore IPU.
We do this by extending this family of models in the following ways: (i) generalising depthwise convolutions to group convolutions; (ii) adding proxy-normalized activations to match batch normalization performance with batch-independent statistics; and (iii) reducing compute by lowering the training resolution and inexpensively fine-tuning at higher resolution.
- Score: 8.411385346896413
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Much recent research has been dedicated to improving the efficiency of
training and inference for image classification. This effort has commonly
focused on explicitly improving theoretical efficiency, often measured as
ImageNet validation accuracy per FLOP. These theoretical savings have, however,
proven challenging to achieve in practice, particularly on high-performance
training accelerators.
In this work, we focus on improving the practical efficiency of the
state-of-the-art EfficientNet models on a new class of accelerator, the
Graphcore IPU. We do this by extending this family of models in the following
ways: (i) generalising depthwise convolutions to group convolutions; (ii)
adding proxy-normalized activations to match batch normalization performance
with batch-independent statistics; (iii) reducing compute by lowering the
training resolution and inexpensively fine-tuning at higher resolution. We find
that these three methods improve the practical efficiency for both training and
inference. Our code will be made available online.
Related papers
- PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation [61.57833648734164]
We propose a novel Parallel Yielding Re-Activation (PYRA) method for training-inference efficient task adaptation.
PYRA outperforms all competing methods under both low compression rate and high compression rate.
arXiv Detail & Related papers (2024-03-14T09:06:49Z) - Accelerating Neural Network Training: A Brief Review [0.5825410941577593]
This study examines innovative approaches to expedite the training process of deep neural networks (DNN)
The research utilizes sophisticated methodologies, including Gradient Accumulation (GA), Automatic Mixed Precision (AMP), and Pin Memory (PM)
arXiv Detail & Related papers (2023-12-15T18:43:45Z) - No Train No Gain: Revisiting Efficient Training Algorithms For
Transformer-based Language Models [31.080446886440757]
In this work, we revisit three categories of such algorithms: dynamic architectures (layer stacking, dropping), batch selection (selective backprop, RHO loss), and efficient layers (Lion, Sophia)
We find that their training, validation, and downstream gains vanish compared to a baseline with a fully-decayed learning rate.
We define an evaluation protocol that enables machines to be done on arbitrary computation by mapping all computation time to a reference machine which we call reference system time.
arXiv Detail & Related papers (2023-07-12T20:10:14Z) - Towards Compute-Optimal Transfer Learning [82.88829463290041]
We argue that zero-shot structured pruning of pretrained models allows them to increase compute efficiency with minimal reduction in performance.
Our results show that pruning convolutional filters of pretrained models can lead to more than 20% performance improvement in low computational regimes.
arXiv Detail & Related papers (2023-04-25T21:49:09Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z) - Pruning Convolutional Filters using Batch Bridgeout [14.677724755838556]
State-of-the-art computer vision models are rapidly increasing in capacity, where the number of parameters far exceeds the number required to fit the training set.
This results in better optimization and generalization performance.
In order to reduce inference costs, convolutional filters in trained neural networks could be pruned to reduce the run-time memory and computational requirements during inference.
We propose the use of Batch Bridgeout, a sparsity inducing regularization scheme, to train neural networks so that they could be pruned efficiently with minimal degradation in performance.
arXiv Detail & Related papers (2020-09-23T01:51:47Z) - Adaptive Serverless Learning [114.36410688552579]
We propose a novel adaptive decentralized training approach, which can compute the learning rate from data dynamically.
Our theoretical results reveal that the proposed algorithm can achieve linear speedup with respect to the number of workers.
To reduce the communication-efficient overhead, we further propose a communication-efficient adaptive decentralized training approach.
arXiv Detail & Related papers (2020-08-24T13:23:02Z) - Towards Practical Lipreading with Distilled and Efficient Models [57.41253104365274]
Lipreading has witnessed a lot of progress due to the resurgence of neural networks.
Recent works have placed emphasis on aspects such as improving performance by finding the optimal architecture or improving generalization.
There is still a significant gap between the current methodologies and the requirements for an effective deployment of lipreading in practical scenarios.
We propose a series of innovations that significantly bridge that gap: first, we raise the state-of-the-art performance by a wide margin on LRW and LRW-1000 to 88.5% and 46.6%, respectively using self-distillation.
arXiv Detail & Related papers (2020-07-13T16:56:27Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.