Towards Accurate and Compact Architectures via Neural Architecture
Transformer
- URL: http://arxiv.org/abs/2102.10301v1
- Date: Sat, 20 Feb 2021 09:38:10 GMT
- Title: Towards Accurate and Compact Architectures via Neural Architecture
Transformer
- Authors: Yong Guo, Yin Zheng, Mingkui Tan, Qi Chen, Zhipeng Li, Jian Chen,
Peilin Zhao, Junzhou Huang
- Abstract summary: It is necessary to optimize the operations inside an architecture to improve the performance without introducing extra computational cost.
We have proposed a Neural Architecture Transformer (NAT) method which casts the optimization problem into a Markov Decision Process (MDP)
We propose a Neural Architecture Transformer++ (NAT++) method which further enlarges the set of candidate transitions to improve the performance of architecture optimization.
- Score: 95.4514639013144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Designing effective architectures is one of the key factors behind the
success of deep neural networks. Existing deep architectures are either
manually designed or automatically searched by some Neural Architecture Search
(NAS) methods. However, even a well-designed/searched architecture may still
contain many nonsignificant or redundant modules/operations. Thus, it is
necessary to optimize the operations inside an architecture to improve the
performance without introducing extra computational cost. To this end, we have
proposed a Neural Architecture Transformer (NAT) method which casts the
optimization problem into a Markov Decision Process (MDP) and seeks to replace
the redundant operations with more efficient operations, such as skip or null
connection. Note that NAT only considers a small number of possible transitions
and thus comes with a limited search/transition space. As a result, such a
small search space may hamper the performance of architecture optimization. To
address this issue, we propose a Neural Architecture Transformer++ (NAT++)
method which further enlarges the set of candidate transitions to improve the
performance of architecture optimization. Specifically, we present a two-level
transition rule to obtain valid transitions, i.e., allowing operations to have
more efficient types (e.g., convolution->separable convolution) or smaller
kernel sizes (e.g., 5x5->3x3). Note that different operations may have
different valid transitions. We further propose a Binary-Masked Softmax
(BMSoftmax) layer to omit the possible invalid transitions. Extensive
experiments on several benchmark datasets show that the transformed
architecture significantly outperforms both its original counterpart and the
architectures optimized by existing methods.
Related papers
- AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation [48.82264764771652]
We introduce AsCAN -- a hybrid architecture, combining both convolutional and transformer blocks.
AsCAN supports a variety of tasks: recognition, segmentation, class-conditional image generation.
We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance.
arXiv Detail & Related papers (2024-11-07T18:43:17Z) - Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications.
The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate.
There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z) - Learning Interpretable Models Through Multi-Objective Neural
Architecture Search [0.9990687944474739]
We propose a framework to optimize for both task performance and "introspectability," a surrogate metric for aspects of interpretability.
We demonstrate that jointly optimizing for task error and introspectability leads to more disentangled and debuggable architectures that perform within error.
arXiv Detail & Related papers (2021-12-16T05:50:55Z) - Rethinking Architecture Selection in Differentiable NAS [74.61723678821049]
Differentiable Neural Architecture Search is one of the most popular NAS methods for its search efficiency and simplicity.
We propose an alternative perturbation-based architecture selection that directly measures each operation's influence on the supernet.
We find that several failure modes of DARTS can be greatly alleviated with the proposed selection method.
arXiv Detail & Related papers (2021-08-10T00:53:39Z) - iDARTS: Differentiable Architecture Search with Stochastic Implicit
Gradients [75.41173109807735]
Differentiable ARchiTecture Search (DARTS) has recently become the mainstream of neural architecture search (NAS)
We tackle the hypergradient computation in DARTS based on the implicit function theorem.
We show that the architecture optimisation with the proposed method, named iDARTS, is expected to converge to a stationary point.
arXiv Detail & Related papers (2021-06-21T00:44:11Z) - Operation Embeddings for Neural Architecture Search [15.033712726016255]
We propose the replacement of fixed operator encoding with learnable representations in the optimization process.
Our method produces top-performing architectures that share similar operation and graph patterns.
arXiv Detail & Related papers (2021-05-11T09:17:10Z) - Differentiable Neural Architecture Transformation for Reproducible
Architecture Improvement [3.766702945560518]
We propose differentiable neural architecture transformation that is reproducible and efficient.
Extensive experiments on two datasets, i.e., CIFAR-10 and Tiny Imagenet, present that the proposed method definitely outperforms NAT.
arXiv Detail & Related papers (2020-06-15T09:03:48Z) - Stage-Wise Neural Architecture Search [65.03109178056937]
Modern convolutional networks such as ResNet and NASNet have achieved state-of-the-art results in many computer vision applications.
These networks consist of stages, which are sets of layers that operate on representations in the same resolution.
It has been demonstrated that increasing the number of layers in each stage improves the prediction ability of the network.
However, the resulting architecture becomes computationally expensive in terms of floating point operations, memory requirements and inference time.
arXiv Detail & Related papers (2020-04-23T14:16:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.