Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with
Architecture-Routed Mixture-of-Experts
- URL: http://arxiv.org/abs/2306.04845v1
- Date: Thu, 8 Jun 2023 00:35:36 GMT
- Title: Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with
Architecture-Routed Mixture-of-Experts
- Authors: Ganesh Jawahar, Haichuan Yang, Yunyang Xiong, Zechun Liu, Dilin Wang,
Fei Sun, Meng Li, Aasish Pappu, Barlas Oguz, Muhammad Abdul-Mageed, Laks V.
S. Lakshmanan, Raghuraman Krishnamoorthi, Vikas Chandra
- Abstract summary: We propose mixture-of-supernets, where mixture-of-experts (MoE) is adopted to enhance the expressive power of the supernet model.
Compared to existing weight-sharing supernet for NLP, our method can minimize the retraining time, greatly improving training efficiency.
- Score: 52.71174872516908
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Weight-sharing supernet has become a vital component for performance
estimation in the state-of-the-art (SOTA) neural architecture search (NAS)
frameworks. Although supernet can directly generate different subnetworks
without retraining, there is no guarantee for the quality of these subnetworks
because of weight sharing. In NLP tasks such as machine translation and
pre-trained language modeling, we observe that given the same model
architecture, there is a large performance gap between supernet and training
from scratch. Hence, supernet cannot be directly used and retraining is
necessary after finding the optimal architectures.
In this work, we propose mixture-of-supernets, a generalized supernet
formulation where mixture-of-experts (MoE) is adopted to enhance the expressive
power of the supernet model, with negligible training overhead. In this way,
different subnetworks do not share the model weights directly, but through an
architecture-based routing mechanism. As a result, model weights of different
subnetworks are customized towards their specific architectures and the weight
generation is learned by gradient descent. Compared to existing weight-sharing
supernet for NLP, our method can minimize the retraining time, greatly
improving training efficiency. In addition, the proposed method achieves the
SOTA performance in NAS for building fast machine translation models, yielding
better latency-BLEU tradeoff compared to HAT, state-of-the-art NAS for MT. We
also achieve the SOTA performance in NAS for building memory-efficient
task-agnostic BERT models, outperforming NAS-BERT and AutoDistil in various
model sizes.
Related papers
- Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch [72.26822499434446]
Auto-Train-Once (ATO) is an innovative network pruning algorithm designed to automatically reduce the computational and storage costs of DNNs.
We provide a comprehensive convergence analysis as well as extensive experiments, and the results show that our approach achieves state-of-the-art performance across various model architectures.
arXiv Detail & Related papers (2024-03-21T02:33:37Z) - TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression
For On-device ASR Models [30.758876520227666]
TODM is a new approach to efficiently train many sizes of hardware-friendly on-device ASR models with comparable GPU-hours to that of a single training job.
We introduce a novel combination of three techniques to improve the outcomes of the TODM Supernet.
Results demonstrate that our TODM Supernet either matches or surpasses the performance of manually tuned models by up to a relative of 3% better in word error rate (WER)
arXiv Detail & Related papers (2023-09-05T04:47:55Z) - NASRec: Weight Sharing Neural Architecture Search for Recommender
Systems [40.54254555949057]
We propose NASRec, a paradigm that trains a single supernet and efficiently produces abundant models/sub-architectures by weight sharing.
Our results on three Click-Through Rates (CTR) prediction benchmarks show that NASRec can outperform both manually designed models and existing NAS methods.
arXiv Detail & Related papers (2022-07-14T20:15:11Z) - Supernet Training for Federated Image Classification under System
Heterogeneity [15.2292571922932]
In this work, we propose a novel framework to consider both scenarios, namely Federation of Supernet Training (FedSup)
It is inspired by how averaging parameters in the model aggregation stage of Federated Learning (FL) is similar to weight-sharing in supernet training.
Under our framework, we present an efficient algorithm (E-FedSup) by sending the sub-model to clients in the broadcast stage for reducing communication costs and training overhead.
arXiv Detail & Related papers (2022-06-03T02:21:01Z) - Evolutionary Neural Cascade Search across Supernetworks [68.8204255655161]
We introduce ENCAS - Evolutionary Neural Cascade Search.
ENCAS can be used to search over multiple pretrained supernetworks.
We test ENCAS on common computer vision benchmarks.
arXiv Detail & Related papers (2022-03-08T11:06:01Z) - Enabling NAS with Automated Super-Network Generation [60.72821429802335]
Recent Neural Architecture Search (NAS) solutions have produced impressive results training super-networks and then derivingworks.
We present BootstrapNAS, a software framework for automatic generation of super-networks for NAS.
arXiv Detail & Related papers (2021-12-20T21:45:48Z) - An Analysis of Super-Net Heuristics in Weight-Sharing NAS [70.57382341642418]
We show that simple random search achieves competitive performance to complex state-of-the-art NAS algorithms when the super-net is properly trained.
We show that simple random search achieves competitive performance to complex state-of-the-art NAS algorithms when the super-net is properly trained.
arXiv Detail & Related papers (2021-10-04T02:18:44Z) - AlphaNet: Improved Training of Supernet with Alpha-Divergence [28.171262066145616]
We propose to improve the supernet training with a more generalized alpha-divergence.
We apply the proposed alpha-divergence based supernet training to both slimmable neural networks and weight-sharing NAS.
Specifically, our discovered model family, AlphaNet, outperforms prior-art models on a wide range of FLOPs regimes.
arXiv Detail & Related papers (2021-02-16T04:23:55Z) - BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage
Models [59.95091850331499]
We propose BigNAS, an approach that challenges the conventional wisdom that post-processing of the weights is necessary to get good prediction accuracies.
Our discovered model family, BigNASModels, achieve top-1 accuracies ranging from 76.5% to 80.9%.
arXiv Detail & Related papers (2020-03-24T23:00:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.