Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training
- URL: http://arxiv.org/abs/2011.14660v3
- Date: Sat, 20 Mar 2021 14:03:54 GMT
- Title: Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training
- Authors: Shuai Zhao, Liguang Zhou, Wenxiao Wang, Deng Cai, Tin Lun Lam,
Yangsheng Xu
- Abstract summary: We argue that increasing the number of networks (ensemble) can achieve better accuracy-efficiency trade-offs than purely increasing the width.
Small networks can achieve better ensemble performance than the large one with few or no extra parameters or FLOPs.
- Score: 24.586453683904487
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The width of a neural network matters since increasing the width will
necessarily increase the model capacity. However, the performance of a network
does not improve linearly with the width and soon gets saturated. In this case,
we argue that increasing the number of networks (ensemble) can achieve better
accuracy-efficiency trade-offs than purely increasing the width. To prove it,
one large network is divided into several small ones regarding its parameters
and regularization components. Each of these small networks has a fraction of
the original one's parameters. We then train these small networks together and
make them see various views of the same data to increase their diversity.
During this co-training process, networks can also learn from each other. As a
result, small networks can achieve better ensemble performance than the large
one with few or no extra parameters or FLOPs. Small networks can also achieve
faster inference speed than the large one by concurrent running on different
devices. We validate our argument with 8 different neural architectures on
common benchmarks through extensive experiments. The code is available at
\url{https://github.com/mzhaoshuai/Divide-and-Co-training}.
Related papers
- Network Fission Ensembles for Low-Cost Self-Ensembles [20.103367702014474]
We propose a low-cost ensemble learning and inference, called Network Fission Ensembles (NFE)
We first prune some of the weights to reduce the training burden.
We then group the remaining weights into several sets and create multiple auxiliary paths using each set to construct multi-exits.
arXiv Detail & Related papers (2024-08-05T08:23:59Z) - Kronecker-Factored Approximate Curvature for Modern Neural Network
Architectures [85.76673783330334]
Two different settings of linear weight-sharing layers motivate two flavours of Kronecker-Factored Approximate Curvature (K-FAC)
We show they are exact for deep linear networks with weight-sharing in their respective setting.
We observe little difference between these two K-FAC variations when using them to train both a graph neural network and a vision transformer.
arXiv Detail & Related papers (2023-11-01T16:37:00Z) - Feature-Learning Networks Are Consistent Across Widths At Realistic
Scales [72.27228085606147]
We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets.
Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training.
We observe, however, that ensembles of narrower networks perform worse than a single wide network.
arXiv Detail & Related papers (2023-05-28T17:09:32Z) - Width is Less Important than Depth in ReLU Neural Networks [40.83290846983707]
We show that any target network with inputs in $mathbbRd$ can be approximated by a width $O(d)$ network.
We extend our results to constructing networks with bounded weights, and to constructing networks with width at most $d+2$.
arXiv Detail & Related papers (2022-02-08T13:07:22Z) - Greedy Network Enlarging [53.319011626986004]
We propose a greedy network enlarging method based on the reallocation of computations.
With step-by-step modifying the computations on different stages, the enlarged network will be equipped with optimal allocation and utilization of MACs.
With application of our method on GhostNet, we achieve state-of-the-art 80.9% and 84.3% ImageNet top-1 accuracies.
arXiv Detail & Related papers (2021-07-31T08:36:30Z) - BCNet: Searching for Network Width with Bilaterally Coupled Network [56.14248440683152]
We introduce a new supernet called Bilaterally Coupled Network (BCNet) to address this issue.
In BCNet, each channel is fairly trained and responsible for the same amount of network widths, thus each network width can be evaluated more accurately.
Our method achieves state-of-the-art or competing performance over other baseline methods.
arXiv Detail & Related papers (2021-05-21T18:54:03Z) - Bit-Mixer: Mixed-precision networks with runtime bit-width selection [72.32693989093558]
Bit-Mixer is the first method to train a meta-quantized network where during test time any layer can change its bid-width without affecting the overall network's ability for highly accurate inference.
We show that our method can result in mixed precision networks that exhibit the desirable flexibility properties for on-device deployment without compromising accuracy.
arXiv Detail & Related papers (2021-03-31T17:58:47Z) - Rescaling CNN through Learnable Repetition of Network Parameters [2.137666194897132]
We present a novel rescaling strategy for CNNs based on learnable repetition of its parameters.
We show that small base networks when rescaled, can provide performance comparable to deeper networks with as low as 6% of optimization parameters of the deeper one.
arXiv Detail & Related papers (2021-01-14T15:03:25Z) - Multigrid-in-Channels Architectures for Wide Convolutional Neural
Networks [6.929025509877642]
We present a multigrid approach that combats the quadratic growth of the number of parameters with respect to the number of channels in standard convolutional neural networks (CNNs)
Our examples from supervised image classification show that applying this strategy to residual networks and MobileNetV2 considerably reduces the number of parameters without negatively affecting accuracy.
arXiv Detail & Related papers (2020-06-11T20:28:36Z) - Network Adjustment: Channel Search Guided by FLOPs Utilization Ratio [101.84651388520584]
This paper presents a new framework named network adjustment, which considers network accuracy as a function of FLOPs.
Experiments on standard image classification datasets and a wide range of base networks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-04-06T15:51:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.